> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wirekite.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Loading

> How Wirekite's extractors and loaders work together to move data from source to target databases.

## The Pipeline

Wirekite moves data through a pipeline of independent components. The extractor reads from the source database and writes intermediate files to disk. The mover optionally transports those files to a location accessible by the target. The loader reads the intermediate files and applies them to the target database.

```
Source DB → Extractor → Files → Mover → Files → Loader → Target DB
```

The intermediate files are the decoupling point. Extractors know nothing about the target database and loaders know nothing about the source database. This separation means any supported source can be paired with any supported target without custom integration logic.

In data mode, the orchestrator starts the extractor, mover, and loader in parallel. The extractor writes files continuously, the mover picks them up and transfers them, and the loader consumes them as they arrive. This pipeline architecture minimizes the storage footprint since files are loaded and cleared while extraction is still in progress.

## Standalone Binaries

Each extractor and loader is compiled as its own standalone binary. The orchestrator invokes these binaries as child processes, passing each one a configuration file. The orchestrator constructs these child configuration files by extracting the relevant section from the main configuration. For example, `source.data.dsnFile` in the orchestrator config becomes simply `dsnFile` in the data extractor's config.

Because the binaries are standalone, they can also be run independently outside the orchestrator for testing or debugging purposes. Each binary takes a single argument: the path to its configuration file.

### Binary Selection

The orchestrator uses an internal binary map to determine which binary to invoke for a given source, target, and mode combination. For example, a MySQL source in data mode invokes the MySQL data extractor binary, while a Snowflake target in data mode invokes the Snowflake data loader binary. The same orchestrator binary handles all source and target combinations.

## Local and Remote Database Access

When Wirekite runs on the same host as the database (or shares a filesystem), it can use faster server-side file operations instead of streaming data through the client connection. The `databaseRemote` parameter controls this behavior.

<ResponseField name="databaseRemote" type="boolean" default="true">
  When `true` (the default), Wirekite streams data through the client connection. This works with any database regardless of where it is hosted, including cloud-managed databases like RDS, Cloud SQL, and Azure Database.

  When `false`, Wirekite uses server-side file operations where the database server reads or writes files directly on its local filesystem. This is faster but requires that Wirekite and the database share a filesystem.
</ResponseField>

### How It Works

The mechanism differs between extractors and loaders:

**Extractors** (source side):

* **Remote** (`databaseRemote=true`): The extractor runs a query and streams the result set through the client connection, writing rows to files locally
* **Local** (`databaseRemote=false`): The extractor instructs the database server to write query results directly to a file on the server's filesystem

**Loaders** (target side):

* **Remote** (`databaseRemote=true`): The loader reads files locally and streams the data to the database server through the client connection
* **Local** (`databaseRemote=false`): The loader tells the database server to read files directly from its own filesystem

### Database-Specific Mechanisms

The underlying SQL mechanism used depends on the database:

| Database                 | Remote (default)                    | Local                        |
| ------------------------ | ----------------------------------- | ---------------------------- |
| **MySQL Extractor**      | `SELECT` with client-side streaming | `SELECT ... INTO OUTFILE`    |
| **PostgreSQL Extractor** | `COPY ... TO STDOUT`                | `COPY ... TO '<filepath>'`   |
| **MySQL Loader**         | `LOAD DATA LOCAL INFILE`            | `LOAD DATA INFILE`           |
| **PostgreSQL Loader**    | `COPY ... FROM STDIN`               | `COPY ... FROM '<filepath>'` |
| **SingleStore Loader**   | `LOAD DATA LOCAL INFILE`            | `LOAD DATA INFILE`           |

### Which Databases Support It

The `databaseRemote` parameter is only relevant for databases that have both a client-side and server-side file transfer mechanism:

| Component            | Supports databaseRemote                 |
| -------------------- | --------------------------------------- |
| MySQL extractor      | Yes                                     |
| PostgreSQL extractor | Yes                                     |
| Oracle extractor     | No -- always streams through the client |
| SQL Server extractor | No -- always streams through the client |
| MySQL loader         | Yes                                     |
| PostgreSQL loader    | Yes                                     |
| SingleStore loader   | Yes                                     |
| Oracle loader        | No -- always streams through the client |
| SQL Server loader    | No -- uses native bulk loader           |

Cloud data warehouses (Snowflake, BigQuery, Databricks, Firebolt, Spanner) do not use `databaseRemote`. They have their own staging and upload mechanisms -- for example, Snowflake uses an internal stage with `PUT` and `COPY INTO`, and BigQuery loads through Google Cloud Storage.

### When to Use Each

<Note>
  For most deployments, the default (`databaseRemote=true`) is the right choice. Only set `databaseRemote=false` if Wirekite is running on the same host as the database server or they share a mounted filesystem.
</Note>

* **Cloud-managed databases** (Amazon RDS, Google Cloud SQL, Azure Database): Use `databaseRemote=true`. Server-side file access is not available on managed instances.
* **Self-hosted databases on the same host as Wirekite**: Set `databaseRemote=false` for better performance through server-side file operations.
* **Self-hosted databases on a different host**: Use `databaseRemote=true`. The database server cannot access files on the Wirekite host.

## Thread Counts

Data extractors and data loaders process multiple tables concurrently using configurable thread counts.

<ResponseField name="maxThreads" type="integer" default="5">
  The number of concurrent threads used by a data extractor or data loader. Each thread processes one table at a time. All source extractors and all target loaders support this parameter.
</ResponseField>

The extractor and loader thread counts are configured independently. For example, the extractor can use 8 threads while the loader uses 4. The right values depend on the hardware, network bandwidth, and how much concurrent load the source and target databases can handle.

```shellscript theme={null}
# Extractor threads
source.data.maxThreads=8

# Loader threads
target.data.maxThreads=4
```

Change extractors are single-threaded because they must read the transaction log sequentially to preserve commit ordering. Change loaders use a fixed internal thread pool for parallel merge operations across tables, but this is not user-configurable.

## Handover: Transitioning to Change Replication

When data and change modes are combined in a single configuration, the orchestrator automatically handles the transition from bulk data loading to continuous change replication. After the data phase completes, a handover process captures the exact position in the source database's transaction log. The change extractor then begins from that position, ensuring no data is missed or duplicated.

Each source database type has its own position mechanism -- MySQL uses binlog coordinates, PostgreSQL uses WAL LSN, Oracle uses SCN, and SQL Server uses LSN.

For full details on change replication, handover mechanics, and crash recovery, see [CDC Replication](/run/replication).
