Overview
Wirekite supports Apache Cassandra as a source database for:- Schema Extraction — Read a keyspace’s table definitions into Wirekite’s intermediate
.sktformat. - Data Extraction — Bulk extract table data for an initial load into the target.
- Target Sync — Diff the Cassandra source against the target and emit the change records that reconcile them.
How Cassandra Fits Target Sync
A Cassandra source run follows the standard Target Sync pipeline:Schema migrate
The target’s schema loader turns the
.skt into target DDL, which is applied to create the target tables (with createMergeTables=true).Data extract & load
The data extractor writes a primary-key-ordered
.dkt snapshot; the target’s data loader bulk-loads it.Validate & sync
The TableValidator diffs source against target; any drift is emitted as change records and applied, then re-validated.
Prerequisites
Connectivity
- Driver: Wirekite connects with the
gocqldriver over the native CQL protocol (default port9042). - Contact points: One or more reachable ring nodes. Wirekite discovers the rest of the ring from the contact points unless host lookup is disabled (see Connection String).
- Consistency: Reads default to
LOCAL_QUORUM. Target Sync assumes a static source during a run — it does not take a point-in-time snapshot; a later run re-converges any changes made during the previous one.
Permissions
The connecting role needs:SELECTon the keyspace tables being extracted — the data extractor runs full token-range scans over them.- Read access to
system_schema— the schema extractor readssystem_schema.columns(andsystem_schema.typesfor UDTs). Authenticated roles can read the system keyspaces by default.
Wirekite only reads from Cassandra — it creates no tables, triggers, or tracking state on the source. Extraction progress is tracked in local files on the Wirekite host, not in the cluster.
Connection String
The Cassandra DSN is a single line in thedsnFile:
| Component | Default | Notes |
|---|---|---|
user:pass@ | none | Optional. Enables password authentication. |
host1,host2,… | — | One or more contact points, comma-separated. |
:port | 9042 | A trailing numeric port applies to all hosts. |
/keyspace | none | Optional. The schema extractor reads system_schema regardless of the session keyspace. |
?consistency= | LOCAL_QUORUM | One of ANY, ONE, TWO, THREE, QUORUM, ALL, LOCAL_QUORUM, EACH_QUORUM, LOCAL_ONE. |
?disable_host_lookup= | false | When true, use only the listed contact point(s) and skip system.peers discovery. Needed when the ring’s advertised IPs aren’t routable from the Wirekite host (e.g. a docker-bridged cluster). |
Schema Extractor
The Schema Extractor reads table definitions from Cassandra’ssystem_schema and outputs them to Wirekite’s intermediate schema format (wirekite_schema.skt). For each keyspace.table it maps each CQL column type to a Wirekite type, derives the primary key from the partition and clustering columns, and synthesizes sensible bounds for the precisionless CQL types (see the Datatype Matrix).
Required Parameters
Path to a file containing the Cassandra connection string (one line). See Connection String.
Path to a file listing the tables to extract, one per line in
keyspace.table format. Blank lines and lines starting with # are ignored.Absolute path to the directory where Wirekite writes
wirekite_schema.skt. The directory must exist and be writable.Absolute path to the schema extractor’s log file.
Optional Parameters
Map a source keyspace to a different target schema name. Format:
srcKeyspace:tgtSchema[,src2:tgt2,...]. For example, app_keyspace:app_schema writes app_schema as the schema name in the output. Must match the schemaRename used in the data extractor and the emit stage.A CQL primary key (
partition_key…, clustering…) is emitted as the target’s composite primary key in declaration order. Primary-key columns are non-nullable; all other columns are nullable. Static columns are emitted as ordinary (denormalized) columns.Data Extractor
The Data Extractor performs the initial bulk extraction, writing a per-table snapshot to Wirekite’s intermediate data format (.dkt files) that the target’s data loader applies. For parallelism it uses the standard Cassandra full-scan pattern: it splits the Murmur3 token ring into ranges and scans each concurrently across worker threads, spread over the ring’s nodes. Each (table, range) writes its own schema.table.N.dkt file.
Required Parameters
Path to a file containing the Cassandra connection string.
Path to a file listing the tables to extract, one per line in
keyspace.table format.Absolute path to the directory where Wirekite writes data files (
schema.table.N.dkt).Absolute path to the data extractor’s log file.
Optional Parameters
Number of parallel scan workers. Token ranges are distributed across these workers, so raising it increases concurrency across the ring. Bound it by the cluster’s capacity.
Number of token-ring slices per table (the parallelism granularity). Defaults to
maxThreads × 8, which provides enough jobs to keep every worker busy and even out token skew. Set it explicitly to tune for very large or very skewed tables.When
true, binary and string data is encoded as hexadecimal instead of base64. Larger files, but may be required for certain target loaders.Map a source keyspace to a different output schema name. Format:
srcKeyspace:tgtSchema[,...]. Must match the value used elsewhere in the run.Extraction is crash-recoverable: each
.dkt file is tracked locally, so a re-run skips ranges that already finished and resumes where it left off. A DATA.DONE marker is written only once every range completes.Validation & Emit
The diff/emit stage is the TableValidator reading a Cassandra source. Because Cassandra cannot stream rows in primary-key order — it orders bytoken(), a hash — the validator uses a bounded-footprint (pk, hash) projection: it projects each source row to a primary key plus a row hash, sorts and merges those against the target’s same projection, and re-fetches the full rows only for the mismatches. This keeps memory bounded (≈30M rows validate in ~3 GB) instead of materializing whole tables.
Set these keys on the emit/validate config (see Target Sync § Stage 1):
When
true, uses the bounded (pk, hash) diff. Recommended. Applies to single-PK tables; tables with a composite primary key automatically fall back to a whole-table sort-merge.In-memory ceiling for the
(pk, hash) projection before it spills to a k-way merge on disk. Trade RAM for spill I/O.In-memory ceiling for the whole-row sort (the composite-PK fallback path).
Resilience
A full token-range scan over a large ring takes minutes, and a stressed ring can stall mid-scan. Wirekite wraps the Cassandra reads in a transient-retry and session-recreate layer: transient errors are retried with backoff, and a wedged connection pool (no hosts available) is healed by recreating the gocql session. Re-scanned boundary rows are deduplicated by primary key during the sort. This lets a long scan survive a degraded cluster — a 10M-row run has survived 16 pool-wedge recoveries and still passed validation.
Limitations
- Source only, Target Sync only. No CDC, and Cassandra is not supported as a target.
- Text is bounded to 4000 characters. Cassandra
texthas no length; Wirekite synthesizes a bound so strict target loaders accept it. Lossy for genuinely large text — see the Datatype Matrix. - Collections and UDTs become opaque JSON.
list,set,map,tuple, and user-defined types are serialized to a single JSON column — they are not exploded into child tables or normalized, and cannot be part of a primary key. - Static columns are not modeled as static — they are denormalized onto every row in the partition.
countercolumns are read as their current point-in-time value.- No point-in-time snapshot. Target Sync assumes a static source during a run; changes made mid-run are reconciled by a later run.
Related
Target Sync
The one-shot reconciliation feature Cassandra sources flow through.
Cassandra Datatype Matrix
Full CQL-to-target type mapping and the synthesized bounds.
