Cassandra - Wirekite Docs

Overview

Wirekite supports Apache Cassandra as a source database for:

Schema Extraction — Read a keyspace’s table definitions into Wirekite’s intermediate .skt format.
Data Extraction — Bulk extract table data for an initial load into the target.
Target Sync — Diff the Cassandra source against the target and emit the change records that reconcile them.

Cassandra is supported one-directionally, as a source, for Target Sync only. There is no CDC / streaming replication path for Cassandra. Commit-log CDC is per-node, needs replica deduplication and back-pressure handling; Target Sync sidesteps all of it by walking the source and target in primary-key order and emitting only the differences.

Cassandra is mostly relational-shaped, which is what makes this work: CQL tables have a fixed, declared schema; the primary key (partition key + clustering columns) becomes a relational composite primary key; and scalar columns map one-to-one. The only impedance is complex column types (collections, UDTs, tuples), which are carried as JSON. See the Cassandra Datatype Matrix for the full mapping.

How Cassandra Fits Target Sync

A Cassandra source run follows the standard Target Sync pipeline:

Schema extract

The schema extractor reads system_schema and writes wirekite_schema.skt.

Schema migrate

The target’s schema loader turns the .skt into target DDL, which is applied to create the target tables (with createMergeTables=true).

Data extract & load

The data extractor writes a primary-key-ordered .dkt snapshot; the target’s data loader bulk-loads it.

Validate & sync

The TableValidator diffs source against target; any drift is emitted as change records and applied, then re-validated.

For the end-to-end mechanics (emit + apply stages, config files, resume), see the Target Sync guide. This page covers the Cassandra-specific source configuration.

Prerequisites

Connectivity

Driver: Wirekite connects with the gocql driver over the native CQL protocol (default port 9042).
Contact points: One or more reachable ring nodes. Wirekite discovers the rest of the ring from the contact points unless host lookup is disabled (see Connection String).
Consistency: Reads default to LOCAL_QUORUM. Target Sync assumes a static source during a run — it does not take a point-in-time snapshot; a later run re-converges any changes made during the previous one.

Permissions

The connecting role needs:

SELECT on the keyspace tables being extracted — the data extractor runs full token-range scans over them.
Read access to system_schema — the schema extractor reads system_schema.columns (and system_schema.types for UDTs). Authenticated roles can read the system keyspaces by default.

-- Grant read access to the keyspace being extracted
GRANT SELECT ON KEYSPACE app_keyspace TO wirekite;

Wirekite only reads from Cassandra — it creates no tables, triggers, or tracking state on the source. Extraction progress is tracked in local files on the Wirekite host, not in the cluster.

Connection String

The Cassandra DSN is a single line in the dsnFile:

cassandra://[user:pass@]host1[,host2,...][:port]/[keyspace][?option=value&...]

Examples:

# Single node, default port (9042) and consistency (LOCAL_QUORUM)
cassandra://10.0.0.5/app_keyspace

# Authenticated, multi-node contact points, explicit port
cassandra://wirekite:secret@10.0.0.5,10.0.0.6,10.0.0.7:9042/app_keyspace

# Reach a docker-bridged ring from another host
cassandra://10.0.0.5/app_keyspace?disable_host_lookup=true&consistency=LOCAL_QUORUM

Component	Default	Notes
`user:pass@`	none	Optional. Enables password authentication.
`host1,host2,…`	—	One or more contact points, comma-separated.
`:port`	`9042`	A trailing numeric port applies to all hosts.
`/keyspace`	none	Optional. The schema extractor reads `system_schema` regardless of the session keyspace.
`?consistency=`	`LOCAL_QUORUM`	One of `ANY`, `ONE`, `TWO`, `THREE`, `QUORUM`, `ALL`, `LOCAL_QUORUM`, `EACH_QUORUM`, `LOCAL_ONE`.
`?disable_host_lookup=`	`false`	When `true`, use only the listed contact point(s) and skip `system.peers` discovery. Needed when the ring’s advertised IPs aren’t routable from the Wirekite host (e.g. a docker-bridged cluster).

Schema Extractor

The Schema Extractor reads table definitions from Cassandra’s system_schema and outputs them to Wirekite’s intermediate schema format (wirekite_schema.skt). For each keyspace.table it maps each CQL column type to a Wirekite type, derives the primary key from the partition and clustering columns, and synthesizes sensible bounds for the precisionless CQL types (see the Datatype Matrix).

Required Parameters

dsnFile

string

required

Path to a file containing the Cassandra connection string (one line). See Connection String.

tablesFile

string

required

Path to a file listing the tables to extract, one per line in keyspace.table format. Blank lines and lines starting with # are ignored.

Example tablesFile contents:

app_keyspace.users
app_keyspace.orders
analytics.events

outputDirectory

string

required

Absolute path to the directory where Wirekite writes wirekite_schema.skt. The directory must exist and be writable.

logFile

string

required

Absolute path to the schema extractor’s log file.

Optional Parameters

schemaRename

string

default:"none"

Map a source keyspace to a different target schema name. Format: srcKeyspace:tgtSchema[,src2:tgt2,...]. For example, app_keyspace:app_schema writes app_schema as the schema name in the output. Must match the schemaRename used in the data extractor and the emit stage.

A CQL primary key (partition_key…, clustering…) is emitted as the target’s composite primary key in declaration order. Primary-key columns are non-nullable; all other columns are nullable. Static columns are emitted as ordinary (denormalized) columns.

Data Extractor

The Data Extractor performs the initial bulk extraction, writing a per-table snapshot to Wirekite’s intermediate data format (.dkt files) that the target’s data loader applies. For parallelism it uses the standard Cassandra full-scan pattern: it splits the Murmur3 token ring into ranges and scans each concurrently across worker threads, spread over the ring’s nodes. Each (table, range) writes its own schema.table.N.dkt file.

Required Parameters

dsnFile

string

required

Path to a file containing the Cassandra connection string.

tablesFile

string

required

Path to a file listing the tables to extract, one per line in keyspace.table format.

outputDirectory

string

required

Absolute path to the directory where Wirekite writes data files (schema.table.N.dkt).

logFile

string

required

Absolute path to the data extractor’s log file.

Optional Parameters

maxThreads

integer

default:"5"

Number of parallel scan workers. Token ranges are distributed across these workers, so raising it increases concurrency across the ring. Bound it by the cluster’s capacity.

tokenRanges

integer

Number of token-ring slices per table (the parallelism granularity). Defaults to maxThreads × 8, which provides enough jobs to keep every worker busy and even out token skew. Set it explicitly to tune for very large or very skewed tables.

hexEncoding

boolean

default:"false"

When true, binary and string data is encoded as hexadecimal instead of base64. Larger files, but may be required for certain target loaders.

schemaRename

string

default:"none"

Map a source keyspace to a different output schema name. Format: srcKeyspace:tgtSchema[,...]. Must match the value used elsewhere in the run.

Extraction is crash-recoverable: each .dkt file is tracked locally, so a re-run skips ranges that already finished and resumes where it left off. A DATA.DONE marker is written only once every range completes.

Validation & Emit

The diff/emit stage is the TableValidator reading a Cassandra source. Because Cassandra cannot stream rows in primary-key order — it orders by token(), a hash — the validator uses a bounded-footprint (pk, hash) projection: it projects each source row to a primary key plus a row hash, sorts and merges those against the target’s same projection, and re-fetches the full rows only for the mismatches. This keeps memory bounded (≈30M rows validate in ~3 GB) instead of materializing whole tables. Set these keys on the emit/validate config (see Target Sync § Stage 1):

cassandraPKHashValidation

boolean

When true, uses the bounded (pk, hash) diff. Recommended. Applies to single-PK tables; tables with a composite primary key automatically fall back to a whole-table sort-merge.

cassandraPKHashMemRows

integer

In-memory ceiling for the (pk, hash) projection before it spills to a k-way merge on disk. Trade RAM for spill I/O.

cassandraSortMemRows

integer

In-memory ceiling for the whole-row sort (the composite-PK fallback path).

Resilience

A full token-range scan over a large ring takes minutes, and a stressed ring can stall mid-scan. Wirekite wraps the Cassandra reads in a transient-retry and session-recreate layer: transient errors are retried with backoff, and a wedged connection pool (no hosts available) is healed by recreating the gocql session. Re-scanned boundary rows are deduplicated by primary key during the sort. This lets a long scan survive a degraded cluster — a 10M-row run has survived 16 pool-wedge recoveries and still passed validation.

Limitations

Source only, Target Sync only. No CDC, and Cassandra is not supported as a target.
Text is bounded to 4000 characters. Cassandra text has no length; Wirekite synthesizes a bound so strict target loaders accept it. Lossy for genuinely large text — see the Datatype Matrix.
Collections and UDTs become opaque JSON. list, set, map, tuple, and user-defined types are serialized to a single JSON column — they are not exploded into child tables or normalized, and cannot be part of a primary key.
Static columns are not modeled as static — they are denormalized onto every row in the partition.
counter columns are read as their current point-in-time value.
No point-in-time snapshot. Target Sync assumes a static source during a run; changes made mid-run are reconciled by a later run.

Target Sync

The one-shot reconciliation feature Cassandra sources flow through.

Cassandra Datatype Matrix

Full CQL-to-target type mapping and the synthesized bounds.

​Overview

​How Cassandra Fits Target Sync

​Prerequisites

​Connectivity

​Permissions

​Connection String

​Schema Extractor

​Required Parameters

​Optional Parameters

​Data Extractor

​Required Parameters

​Optional Parameters

​Validation & Emit

​Resilience

​Limitations

​Related

Target Sync

Cassandra Datatype Matrix

Overview

How Cassandra Fits Target Sync

Prerequisites

Connectivity

Permissions

Connection String

Schema Extractor

Required Parameters

Optional Parameters

Data Extractor

Required Parameters

Optional Parameters

Validation & Emit

Resilience

Limitations

Related