> ## Documentation Index > Fetch the complete documentation index at: https://docs.wirekite.io/llms.txt > Use this file to discover all available pages before exploring further. # Cassandra > This guide explains how to configure Apache Cassandra as a source database for Wirekite Target Sync. ## Overview Wirekite supports Apache Cassandra as a source database for: * **Schema Extraction** — Read a keyspace's table definitions into Wirekite's intermediate `.skt` format. * **Data Extraction** — Bulk extract table data for an initial load into the target. * **Target Sync** — Diff the Cassandra source against the target and emit the change records that reconcile them. Cassandra is supported **one-directionally, as a source, for [Target Sync](/run/targetsync) only.** There is **no CDC / streaming replication** path for Cassandra. Commit-log CDC is per-node, needs replica deduplication and back-pressure handling; Target Sync sidesteps all of it by walking the source and target in primary-key order and emitting only the differences. Cassandra is mostly relational-shaped, which is what makes this work: CQL tables have a fixed, declared schema; the primary key (partition key + clustering columns) becomes a relational composite primary key; and scalar columns map one-to-one. The only impedance is complex column types (collections, UDTs, tuples), which are carried as JSON. See the [Cassandra Datatype Matrix](/datatypes/cassandra) for the full mapping. ## How Cassandra Fits Target Sync A Cassandra source run follows the standard Target Sync pipeline: The schema extractor reads `system_schema` and writes `wirekite_schema.skt`. The target's schema loader turns the `.skt` into target DDL, which is applied to create the target tables (with `createMergeTables=true`). The data extractor writes a primary-key-ordered `.dkt` snapshot; the target's data loader bulk-loads it. The [TableValidator](/run/validation) diffs source against target; any drift is emitted as change records and applied, then re-validated. For the end-to-end mechanics (emit + apply stages, config files, resume), see the [Target Sync guide](/run/targetsync). This page covers the Cassandra-specific source configuration. ## Prerequisites ### Connectivity * **Driver:** Wirekite connects with the `gocql` driver over the native CQL protocol (default port `9042`). * **Contact points:** One or more reachable ring nodes. Wirekite discovers the rest of the ring from the contact points unless host lookup is disabled (see [Connection String](#connection-string)). * **Consistency:** Reads default to `LOCAL_QUORUM`. Target Sync assumes a **static source during a run** — it does not take a point-in-time snapshot; a later run re-converges any changes made during the previous one. ### Permissions The connecting role needs: * **`SELECT` on the keyspace tables** being extracted — the data extractor runs full token-range scans over them. * **Read access to `system_schema`** — the schema extractor reads `system_schema.columns` (and `system_schema.types` for UDTs). Authenticated roles can read the system keyspaces by default. ```sql theme={null} -- Grant read access to the keyspace being extracted GRANT SELECT ON KEYSPACE app_keyspace TO wirekite; ``` Wirekite **only reads** from Cassandra — it creates no tables, triggers, or tracking state on the source. Extraction progress is tracked in local files on the Wirekite host, not in the cluster. ## Connection String The Cassandra DSN is a single line in the `dsnFile`: ``` cassandra://[user:pass@]host1[,host2,...][:port]/[keyspace][?option=value&...] ``` **Examples:** ``` # Single node, default port (9042) and consistency (LOCAL_QUORUM) cassandra://10.0.0.5/app_keyspace # Authenticated, multi-node contact points, explicit port cassandra://wirekite:secret@10.0.0.5,10.0.0.6,10.0.0.7:9042/app_keyspace # Reach a docker-bridged ring from another host cassandra://10.0.0.5/app_keyspace?disable_host_lookup=true&consistency=LOCAL_QUORUM ``` | Component | Default | Notes | | ----------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `user:pass@` | none | Optional. Enables password authentication. | | `host1,host2,…` | — | One or more contact points, comma-separated. | | `:port` | `9042` | A trailing numeric port applies to all hosts. | | `/keyspace` | none | Optional. The schema extractor reads `system_schema` regardless of the session keyspace. | | `?consistency=` | `LOCAL_QUORUM` | One of `ANY`, `ONE`, `TWO`, `THREE`, `QUORUM`, `ALL`, `LOCAL_QUORUM`, `EACH_QUORUM`, `LOCAL_ONE`. | | `?disable_host_lookup=` | `false` | When `true`, use only the listed contact point(s) and skip `system.peers` discovery. Needed when the ring's advertised IPs aren't routable from the Wirekite host (e.g. a docker-bridged cluster). | ## Schema Extractor The Schema Extractor reads table definitions from Cassandra's `system_schema` and outputs them to Wirekite's intermediate schema format (`wirekite_schema.skt`). For each `keyspace.table` it maps each CQL column type to a Wirekite type, derives the primary key from the partition and clustering columns, and synthesizes sensible bounds for the precisionless CQL types (see the [Datatype Matrix](/datatypes/cassandra)). ### Required Parameters Path to a file containing the Cassandra connection string (one line). See [Connection String](#connection-string). Path to a file listing the tables to extract, one per line in `keyspace.table` format. Blank lines and lines starting with `#` are ignored. **Example tablesFile contents:** ``` app_keyspace.users app_keyspace.orders analytics.events ``` Absolute path to the directory where Wirekite writes `wirekite_schema.skt`. The directory must exist and be writable. Absolute path to the schema extractor's log file. ### Optional Parameters Map a source keyspace to a different target schema name. Format: `srcKeyspace:tgtSchema[,src2:tgt2,...]`. For example, `app_keyspace:app_schema` writes `app_schema` as the schema name in the output. Must match the `schemaRename` used in the data extractor and the emit stage. A CQL primary key (`partition_key…`, `clustering…`) is emitted as the target's composite primary key in declaration order. Primary-key columns are non-nullable; all other columns are nullable. Static columns are emitted as ordinary (denormalized) columns. ## Data Extractor The Data Extractor performs the initial bulk extraction, writing a per-table snapshot to Wirekite's intermediate data format (`.dkt` files) that the target's data loader applies. For parallelism it uses the standard Cassandra full-scan pattern: it splits the Murmur3 token ring into ranges and scans each concurrently across worker threads, spread over the ring's nodes. Each `(table, range)` writes its own `schema.table.N.dkt` file. ### Required Parameters Path to a file containing the Cassandra connection string. Path to a file listing the tables to extract, one per line in `keyspace.table` format. Absolute path to the directory where Wirekite writes data files (`schema.table.N.dkt`). Absolute path to the data extractor's log file. ### Optional Parameters Number of parallel scan workers. Token ranges are distributed across these workers, so raising it increases concurrency across the ring. Bound it by the cluster's capacity. Number of token-ring slices per table (the parallelism granularity). Defaults to `maxThreads × 8`, which provides enough jobs to keep every worker busy and even out token skew. Set it explicitly to tune for very large or very skewed tables. When `true`, binary and string data is encoded as hexadecimal instead of base64. Larger files, but may be required for certain target loaders. Map a source keyspace to a different output schema name. Format: `srcKeyspace:tgtSchema[,...]`. Must match the value used elsewhere in the run. Extraction is crash-recoverable: each `.dkt` file is tracked locally, so a re-run skips ranges that already finished and resumes where it left off. A `DATA.DONE` marker is written only once every range completes. ## Validation & Emit The diff/emit stage is the [TableValidator](/run/validation) reading a Cassandra source. Because Cassandra cannot stream rows in primary-key order — it orders by `token()`, a hash — the validator uses a **bounded-footprint `(pk, hash)` projection**: it projects each source row to a primary key plus a row hash, sorts and merges those against the target's same projection, and re-fetches the full rows only for the mismatches. This keeps memory bounded (≈30M rows validate in \~3 GB) instead of materializing whole tables. Set these keys on the emit/validate config (see [Target Sync § Stage 1](/run/targetsync#stage-1--emit-tablevalidator)): When `true`, uses the bounded `(pk, hash)` diff. Recommended. Applies to **single-PK tables**; tables with a composite primary key automatically fall back to a whole-table sort-merge. In-memory ceiling for the `(pk, hash)` projection before it spills to a k-way merge on disk. Trade RAM for spill I/O. In-memory ceiling for the whole-row sort (the composite-PK fallback path). ## Resilience A full token-range scan over a large ring takes minutes, and a stressed ring can stall mid-scan. Wirekite wraps the Cassandra reads in a transient-retry and session-recreate layer: transient errors are retried with backoff, and a wedged connection pool (`no hosts available`) is healed by recreating the `gocql` session. Re-scanned boundary rows are deduplicated by primary key during the sort. This lets a long scan survive a degraded cluster — a 10M-row run has survived 16 pool-wedge recoveries and still passed validation. ## Limitations * **Source only, Target Sync only.** No CDC, and Cassandra is not supported as a target. * **Text is bounded to 4000 characters.** Cassandra `text` has no length; Wirekite synthesizes a bound so strict target loaders accept it. Lossy for genuinely large text — see the [Datatype Matrix](/datatypes/cassandra). * **Collections and UDTs become opaque JSON.** `list`, `set`, `map`, `tuple`, and user-defined types are serialized to a single JSON column — they are not exploded into child tables or normalized, and cannot be part of a primary key. * **Static columns are not modeled as static** — they are denormalized onto every row in the partition. * **`counter` columns** are read as their current point-in-time value. * **No point-in-time snapshot.** Target Sync assumes a static source during a run; changes made mid-run are reconciled by a later run. ## Related The one-shot reconciliation feature Cassandra sources flow through. Full CQL-to-target type mapping and the synthesized bounds.