Skip to main content

The Pipeline

Wirekite moves data through a pipeline of independent components. The extractor reads from the source database and writes intermediate files to disk. The mover optionally transports those files to a location accessible by the target. The loader reads the intermediate files and applies them to the target database.
Source DB → Extractor → Files → Mover → Files → Loader → Target DB
The intermediate files are the decoupling point. Extractors know nothing about the target database and loaders know nothing about the source database. This separation means any supported source can be paired with any supported target without custom integration logic. In data mode, the orchestrator starts the extractor, mover, and loader in parallel. The extractor writes files continuously, the mover picks them up and transfers them, and the loader consumes them as they arrive. This pipeline architecture minimizes the storage footprint since files are loaded and cleared while extraction is still in progress.

Standalone Binaries

Each extractor and loader is compiled as its own standalone binary. The orchestrator invokes these binaries as child processes, passing each one a configuration file. The orchestrator constructs these child configuration files by extracting the relevant section from the main configuration. For example, source.data.dsnFile in the orchestrator config becomes simply dsnFile in the data extractor’s config. Because the binaries are standalone, they can also be run independently outside the orchestrator for testing or debugging purposes. Each binary takes a single argument: the path to its configuration file.

Binary Selection

The orchestrator uses an internal binary map to determine which binary to invoke for a given source, target, and mode combination. For example, a MySQL source in data mode invokes the MySQL data extractor binary, while a Snowflake target in data mode invokes the Snowflake data loader binary. The same orchestrator binary handles all source and target combinations.

Local and Remote Database Access

When Wirekite runs on the same host as the database (or shares a filesystem), it can use faster server-side file operations instead of streaming data through the client connection. The databaseRemote parameter controls this behavior.
databaseRemote
boolean
default:"true"
When true (the default), Wirekite streams data through the client connection. This works with any database regardless of where it is hosted, including cloud-managed databases like RDS, Cloud SQL, and Azure Database.When false, Wirekite uses server-side file operations where the database server reads or writes files directly on its local filesystem. This is faster but requires that Wirekite and the database share a filesystem.

How It Works

The mechanism differs between extractors and loaders: Extractors (source side):
  • Remote (databaseRemote=true): The extractor runs a query and streams the result set through the client connection, writing rows to files locally
  • Local (databaseRemote=false): The extractor instructs the database server to write query results directly to a file on the server’s filesystem
Loaders (target side):
  • Remote (databaseRemote=true): The loader reads files locally and streams the data to the database server through the client connection
  • Local (databaseRemote=false): The loader tells the database server to read files directly from its own filesystem

Database-Specific Mechanisms

The underlying SQL mechanism used depends on the database:
DatabaseRemote (default)Local
MySQL ExtractorSELECT with client-side streamingSELECT ... INTO OUTFILE
PostgreSQL ExtractorCOPY ... TO STDOUTCOPY ... TO '<filepath>'
MySQL LoaderLOAD DATA LOCAL INFILELOAD DATA INFILE
PostgreSQL LoaderCOPY ... FROM STDINCOPY ... FROM '<filepath>'
SingleStore LoaderLOAD DATA LOCAL INFILELOAD DATA INFILE

Which Databases Support It

The databaseRemote parameter is only relevant for databases that have both a client-side and server-side file transfer mechanism:
ComponentSupports databaseRemote
MySQL extractorYes
PostgreSQL extractorYes
Oracle extractorNo — always streams through the client
SQL Server extractorNo — always streams through the client
MySQL loaderYes
PostgreSQL loaderYes
SingleStore loaderYes
Oracle loaderNo — always streams through the client
SQL Server loaderNo — uses native bulk loader
Cloud data warehouses (Snowflake, BigQuery, Databricks, Firebolt, Spanner) do not use databaseRemote. They have their own staging and upload mechanisms — for example, Snowflake uses an internal stage with PUT and COPY INTO, and BigQuery loads through Google Cloud Storage.

When to Use Each

For most deployments, the default (databaseRemote=true) is the right choice. Only set databaseRemote=false if Wirekite is running on the same host as the database server or they share a mounted filesystem.
  • Cloud-managed databases (Amazon RDS, Google Cloud SQL, Azure Database): Use databaseRemote=true. Server-side file access is not available on managed instances.
  • Self-hosted databases on the same host as Wirekite: Set databaseRemote=false for better performance through server-side file operations.
  • Self-hosted databases on a different host: Use databaseRemote=true. The database server cannot access files on the Wirekite host.

Thread Counts

Data extractors and data loaders process multiple tables concurrently using configurable thread counts.
maxThreads
integer
default:"5"
The number of concurrent threads used by a data extractor or data loader. Each thread processes one table at a time. All source extractors and all target loaders support this parameter.
The extractor and loader thread counts are configured independently. For example, the extractor can use 8 threads while the loader uses 4. The right values depend on the hardware, network bandwidth, and how much concurrent load the source and target databases can handle.
# Extractor threads
source.data.maxThreads=8

# Loader threads
target.data.maxThreads=4
Change extractors are single-threaded because they must read the transaction log sequentially to preserve commit ordering. Change loaders use a fixed internal thread pool for parallel merge operations across tables, but this is not user-configurable.

Handover: Transitioning to Change Replication

When data and change modes are combined in a single configuration, the orchestrator automatically handles the transition from bulk data loading to continuous change replication. After the data phase completes, a handover process captures the exact position in the source database’s transaction log. The change extractor then begins from that position, ensuring no data is missed or duplicated. Each source database type has its own position mechanism — MySQL uses binlog coordinates, PostgreSQL uses WAL LSN, Oracle uses SCN, and SQL Server uses LSN. For full details on change replication, handover mechanics, and crash recovery, see CDC Replication.