> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wirekite.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Cassandra

> This guide explains how to configure Apache Cassandra as a source database for Wirekite Target Sync.

## Overview

Wirekite supports Apache Cassandra as a source database for:

* **Schema Extraction** — Read a keyspace's table definitions into Wirekite's intermediate `.skt` format.
* **Data Extraction** — Bulk extract table data for an initial load into the target.
* **Target Sync** — Diff the Cassandra source against the target and emit the change records that reconcile them.

<Warning>
  Cassandra is supported **one-directionally, as a source, for [Target Sync](/run/targetsync) only.** There is **no CDC / streaming replication** path for Cassandra. Commit-log CDC is per-node, needs replica deduplication and back-pressure handling; Target Sync sidesteps all of it by walking the source and target in primary-key order and emitting only the differences.
</Warning>

Cassandra is mostly relational-shaped, which is what makes this work: CQL tables have a fixed, declared schema; the primary key (partition key + clustering columns) becomes a relational composite primary key; and scalar columns map one-to-one. The only impedance is complex column types (collections, UDTs, tuples), which are carried as JSON. See the [Cassandra Datatype Matrix](/datatypes/cassandra) for the full mapping.

## How Cassandra Fits Target Sync

A Cassandra source run follows the standard Target Sync pipeline:

<Steps>
  <Step title="Schema extract">
    The schema extractor reads `system_schema` and writes `wirekite_schema.skt`.
  </Step>

  <Step title="Schema migrate">
    The target's schema loader turns the `.skt` into target DDL, which is applied to create the target tables (with `createMergeTables=true`).
  </Step>

  <Step title="Data extract & load">
    The data extractor writes a primary-key-ordered `.dkt` snapshot; the target's data loader bulk-loads it.
  </Step>

  <Step title="Validate & sync">
    The [TableValidator](/run/validation) diffs source against target; any drift is emitted as change records and applied, then re-validated.
  </Step>
</Steps>

For the end-to-end mechanics (emit + apply stages, config files, resume), see the [Target Sync guide](/run/targetsync). This page covers the Cassandra-specific source configuration.

## Prerequisites

### Connectivity

* **Driver:** Wirekite connects with the `gocql` driver over the native CQL protocol (default port `9042`).
* **Contact points:** One or more reachable ring nodes. Wirekite discovers the rest of the ring from the contact points unless host lookup is disabled (see [Connection String](#connection-string)).
* **Consistency:** Reads default to `LOCAL_QUORUM`. Target Sync assumes a **static source during a run** — it does not take a point-in-time snapshot; a later run re-converges any changes made during the previous one.

### Permissions

The connecting role needs:

* **`SELECT` on the keyspace tables** being extracted — the data extractor runs full token-range scans over them.
* **Read access to `system_schema`** — the schema extractor reads `system_schema.columns` (and `system_schema.types` for UDTs). Authenticated roles can read the system keyspaces by default.

```sql theme={null}
-- Grant read access to the keyspace being extracted
GRANT SELECT ON KEYSPACE app_keyspace TO wirekite;
```

<Note>
  Wirekite **only reads** from Cassandra — it creates no tables, triggers, or tracking state on the source. Extraction progress is tracked in local files on the Wirekite host, not in the cluster.
</Note>

## Connection String

The Cassandra DSN is a single line in the `dsnFile`:

```
cassandra://[user:pass@]host1[,host2,...][:port]/[keyspace][?option=value&...]
```

**Examples:**

```
# Single node, default port (9042) and consistency (LOCAL_QUORUM)
cassandra://10.0.0.5/app_keyspace

# Authenticated, multi-node contact points, explicit port
cassandra://wirekite:secret@10.0.0.5,10.0.0.6,10.0.0.7:9042/app_keyspace

# Reach a docker-bridged ring from another host
cassandra://10.0.0.5/app_keyspace?disable_host_lookup=true&consistency=LOCAL_QUORUM
```

| Component               | Default        | Notes                                                                                                                                                                                              |
| ----------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `user:pass@`            | none           | Optional. Enables password authentication.                                                                                                                                                         |
| `host1,host2,…`         | —              | One or more contact points, comma-separated.                                                                                                                                                       |
| `:port`                 | `9042`         | A trailing numeric port applies to all hosts.                                                                                                                                                      |
| `/keyspace`             | none           | Optional. The schema extractor reads `system_schema` regardless of the session keyspace.                                                                                                           |
| `?consistency=`         | `LOCAL_QUORUM` | One of `ANY`, `ONE`, `TWO`, `THREE`, `QUORUM`, `ALL`, `LOCAL_QUORUM`, `EACH_QUORUM`, `LOCAL_ONE`.                                                                                                  |
| `?disable_host_lookup=` | `false`        | When `true`, use only the listed contact point(s) and skip `system.peers` discovery. Needed when the ring's advertised IPs aren't routable from the Wirekite host (e.g. a docker-bridged cluster). |

## Schema Extractor

The Schema Extractor reads table definitions from Cassandra's `system_schema` and outputs them to Wirekite's intermediate schema format (`wirekite_schema.skt`). For each `keyspace.table` it maps each CQL column type to a Wirekite type, derives the primary key from the partition and clustering columns, and synthesizes sensible bounds for the precisionless CQL types (see the [Datatype Matrix](/datatypes/cassandra)).

### Required Parameters

<ResponseField name="dsnFile" type="string" required>
  Path to a file containing the Cassandra connection string (one line). See [Connection String](#connection-string).
</ResponseField>

<ResponseField name="tablesFile" type="string" required>
  Path to a file listing the tables to extract, one per line in `keyspace.table` format. Blank lines and lines starting with `#` are ignored.
</ResponseField>

**Example tablesFile contents:**

```
app_keyspace.users
app_keyspace.orders
analytics.events
```

<ResponseField name="outputDirectory" type="string" required>
  Absolute path to the directory where Wirekite writes `wirekite_schema.skt`. The directory must exist and be writable.
</ResponseField>

<ResponseField name="logFile" type="string" required>
  Absolute path to the schema extractor's log file.
</ResponseField>

### Optional Parameters

<ResponseField name="schemaRename" type="string" default="none">
  Map a source keyspace to a different target schema name. Format: `srcKeyspace:tgtSchema[,src2:tgt2,...]`. For example, `app_keyspace:app_schema` writes `app_schema` as the schema name in the output. Must match the `schemaRename` used in the data extractor and the emit stage.
</ResponseField>

<Note>
  A CQL primary key (`partition_key…`, `clustering…`) is emitted as the target's composite primary key in declaration order. Primary-key columns are non-nullable; all other columns are nullable. Static columns are emitted as ordinary (denormalized) columns.
</Note>

## Data Extractor

The Data Extractor performs the initial bulk extraction, writing a per-table snapshot to Wirekite's intermediate data format (`.dkt` files) that the target's data loader applies. For parallelism it uses the standard Cassandra full-scan pattern: it splits the Murmur3 token ring into ranges and scans each concurrently across worker threads, spread over the ring's nodes. Each `(table, range)` writes its own `schema.table.N.dkt` file.

### Required Parameters

<ResponseField name="dsnFile" type="string" required>
  Path to a file containing the Cassandra connection string.
</ResponseField>

<ResponseField name="tablesFile" type="string" required>
  Path to a file listing the tables to extract, one per line in `keyspace.table` format.
</ResponseField>

<ResponseField name="outputDirectory" type="string" required>
  Absolute path to the directory where Wirekite writes data files (`schema.table.N.dkt`).
</ResponseField>

<ResponseField name="logFile" type="string" required>
  Absolute path to the data extractor's log file.
</ResponseField>

### Optional Parameters

<ResponseField name="maxThreads" type="integer" default="5">
  Number of parallel scan workers. Token ranges are distributed across these workers, so raising it increases concurrency across the ring. Bound it by the cluster's capacity.
</ResponseField>

<ResponseField name="tokenRanges" type="integer">
  Number of token-ring slices per table (the parallelism granularity). Defaults to `maxThreads × 8`, which provides enough jobs to keep every worker busy and even out token skew. Set it explicitly to tune for very large or very skewed tables.
</ResponseField>

<ResponseField name="hexEncoding" type="boolean" default="false">
  When `true`, binary and string data is encoded as hexadecimal instead of base64. Larger files, but may be required for certain target loaders.
</ResponseField>

<ResponseField name="schemaRename" type="string" default="none">
  Map a source keyspace to a different output schema name. Format: `srcKeyspace:tgtSchema[,...]`. Must match the value used elsewhere in the run.
</ResponseField>

<Note>
  Extraction is crash-recoverable: each `.dkt` file is tracked locally, so a re-run skips ranges that already finished and resumes where it left off. A `DATA.DONE` marker is written only once every range completes.
</Note>

## Validation & Emit

The diff/emit stage is the [TableValidator](/run/validation) reading a Cassandra source. Because Cassandra cannot stream rows in primary-key order — it orders by `token()`, a hash — the validator uses a **bounded-footprint `(pk, hash)` projection**: it projects each source row to a primary key plus a row hash, sorts and merges those against the target's same projection, and re-fetches the full rows only for the mismatches. This keeps memory bounded (≈30M rows validate in \~3 GB) instead of materializing whole tables.

Set these keys on the emit/validate config (see [Target Sync § Stage 1](/run/targetsync#stage-1--emit-tablevalidator)):

<ResponseField name="cassandraPKHashValidation" type="boolean">
  When `true`, uses the bounded `(pk, hash)` diff. Recommended. Applies to **single-PK tables**; tables with a composite primary key automatically fall back to a whole-table sort-merge.
</ResponseField>

<ResponseField name="cassandraPKHashMemRows" type="integer">
  In-memory ceiling for the `(pk, hash)` projection before it spills to a k-way merge on disk. Trade RAM for spill I/O.
</ResponseField>

<ResponseField name="cassandraSortMemRows" type="integer">
  In-memory ceiling for the whole-row sort (the composite-PK fallback path).
</ResponseField>

## Resilience

A full token-range scan over a large ring takes minutes, and a stressed ring can stall mid-scan. Wirekite wraps the Cassandra reads in a transient-retry and session-recreate layer: transient errors are retried with backoff, and a wedged connection pool (`no hosts available`) is healed by recreating the `gocql` session. Re-scanned boundary rows are deduplicated by primary key during the sort. This lets a long scan survive a degraded cluster — a 10M-row run has survived 16 pool-wedge recoveries and still passed validation.

## Limitations

* **Source only, Target Sync only.** No CDC, and Cassandra is not supported as a target.
* **Text is bounded to 4000 characters.** Cassandra `text` has no length; Wirekite synthesizes a bound so strict target loaders accept it. Lossy for genuinely large text — see the [Datatype Matrix](/datatypes/cassandra).
* **Collections and UDTs become opaque JSON.** `list`, `set`, `map`, `tuple`, and user-defined types are serialized to a single JSON column — they are not exploded into child tables or normalized, and cannot be part of a primary key.
* **Static columns are not modeled as static** — they are denormalized onto every row in the partition.
* **`counter` columns** are read as their current point-in-time value.
* **No point-in-time snapshot.** Target Sync assumes a static source during a run; changes made mid-run are reconciled by a later run.

## Related

<CardGroup cols={2}>
  <Card title="Target Sync" icon="arrows-left-right-to-line" href="/run/targetsync">
    The one-shot reconciliation feature Cassandra sources flow through.
  </Card>

  <Card title="Cassandra Datatype Matrix" icon="table-cells" href="/datatypes/cassandra">
    Full CQL-to-target type mapping and the synthesized bounds.
  </Card>
</CardGroup>
