Performance Tuning

Overview

Wirekite performance is controlled by three main levers: thread counts, batch sizes, and compression. This guide covers how each parameter affects performance and provides recommendations for common scenarios.

Thread Configuration

Threads control parallelism across all components. Each component runs its threads independently, so the total resource usage is the sum of all active components.

Extraction Threads

source.data.maxThreads

integer

default:"5"

Number of parallel threads for extracting tables from the source database. Each thread extracts one table at a time.

source.change.maxThreads

integer

default:"5"

Number of parallel threads for the change extractor. Applies to CDC extraction from binlogs, WAL, or transaction logs.

Loading Threads

target.data.maxThreads

integer

default:"5"

Number of parallel threads for loading data files into the target database. Each thread loads one file at a time.

target.change.maxThreads

integer

default:"5"

Number of parallel threads for applying change files to the target database.

Mover Threads

mover.maxThreads

integer

default:"10"

Number of parallel threads for transferring files between source and target staging areas (S3, GCS, or Snowflake internal stage). Movers default to a higher thread count since file transfers are I/O-bound rather than CPU-bound.

Thread Recommendations

Scenario	Extraction	Loading	Mover
Small database (under 10GB)	4	4	4
Medium database (10-100GB)	8	6	8
Large database (over 100GB)	8-12	8	10
Resource-constrained host	2-4	2-4	4

Each extraction thread opens a database connection. Ensure your source database can handle the total connection count: maxThreads per active component. Monitor your database connection limits before increasing threads.

Batch Sizes

Batch sizes control how much data is processed per unit of work. Larger batches improve throughput but use more memory.

Data Extraction

source.data.maxRowsPerDump

integer

default:"200000"

Maximum rows per extracted data file. Tables larger than this threshold are split into multiple files. Applies to MySQL, Oracle, and SQL Server extractors.

source.data.maxPages

integer

default:"8000"

PostgreSQL-specific alternative to maxRowsPerDump. Measures in 8KB database pages rather than rows. The default of 8,000 pages produces files of approximately 64MB.

Sizing guidance:

Row width	Recommended maxRowsPerDump	Approximate file size
Narrow (under 200 bytes)	200,000-500,000	40-100MB
Medium (200-1000 bytes)	100,000-200,000	50-200MB
Wide (over 1000 bytes)	50,000-100,000	50-100MB

Change Loading

target.change.maxFilesPerBatch

integer

default:"30"

Number of change files to process in a single merge operation. Higher values increase throughput but create larger transactions.

CDC Event Buffering

source.change.eventsPerFlush

integer

default:"5000"

Number of change events buffered in memory before flushing to disk. Lower values reduce memory usage but increase I/O operations.

SQL Server CDC

source.change.lsnBatchSize

integer

default:"1000000"

Number of LSN (Log Sequence Number) records to process per batch. SQL Server specific.

Compression

mover.gzipFiles

boolean

default:"false"

Compress data files with gzip before transferring to cloud staging areas. Applies to AWS S3, GCS, and Snowflake internal stage movers.

When to enable compression:

Condition	Recommendation
Network bandwidth is limited	Enable
Cloud egress costs are a concern	Enable
CPU is the bottleneck	Disable
Files are already small (under 10MB)	Disable
Large dataset with high compressibility (text-heavy)	Enable

Tabular data typically compresses 5-10x with gzip. A 200MB data file may compress to 20-40MB, significantly reducing transfer time on constrained networks.

Database-Specific Tuning

MySQL

The extractor uses SQL_NO_CACHE to bypass the query cache and avoid polluting it during bulk reads
Use loadLocal=true if the MySQL server lacks the FILE privilege (uses LOAD DATA LOCAL INFILE instead)
Binary data encoding (hexEncoding or base64Encoding) adds overhead; only enable if required by the target

PostgreSQL

Uses the COPY protocol for high-speed bulk data movement
Tune maxPages instead of maxRowsPerDump for page-aligned extraction
sortFiles=true sorts output by primary key after extraction, which can improve target load performance for ordered storage engines

Oracle

CDC uses System Change Numbers (SCN) for precise position tracking
sortFiles=true enables primary key ordering for extracted data
LOB (Large Object) columns are handled inline and can significantly increase row width

SQL Server

The change extractor maintains a connection pool of 50 open connections and 25 idle connections
useExternalDumper=true uses the BCP utility for faster extraction when available
lsnBatchSize controls the granularity of CDC processing; reduce if memory is constrained

Snowflake

Loading performance depends on the Snowflake warehouse size specified in the connection string
Uses atomic COPY operations with ON_ERROR=ABORT_STATEMENT for data integrity
Each loading thread performs a PUT followed by COPY; the warehouse compute scales with thread count

BigQuery

Requires a GCS bucket for staging (gcs_bucket parameter)
Uses a two-stage load process: CSV files are uploaded to GCS, then loaded via BigQuery load jobs
Loading thread count should be conservative to avoid exceeding BigQuery quotas

Spanner

Be conservative with maxThreads (start at 4-5) as higher values can trigger Spanner rate limiting
Uses mutation batching for CDC operations

Pipeline Architecture

Understanding how the orchestrator coordinates components helps with tuning:

Source DB → [Extractor] → Local Files → [Mover] → Cloud Stage → [Loader] → Target DB
               ↑                           ↑                        ↑
          maxThreads=5              maxThreads=10              maxThreads=5

All three components run in parallel. The extractor produces files, the mover transfers them, and the loader consumes them. The pipeline naturally balances itself:

If extraction is fastest, the mover and loader work through a growing backlog
If loading is slowest, files accumulate in the staging area until the loader catches up
The removeFiles parameter controls whether files are cleaned up after loading

For combined data + change mode, the orchestrator completes the bulk data load first, captures the source position, then starts change replication from that position. Thread counts are configured independently for each phase.

Tuning by Scenario

Maximize Throughput (Large Migration)

When migrating a large database and resources are plentiful:

# Aggressive extraction
source.data.maxThreads=12
source.data.maxRowsPerDump=200000

# Fast loading
target.data.maxThreads=8

# Fast file transfer
mover.maxThreads=10
mover.gzipFiles=false

Minimize Resource Usage

When running on a constrained host or sharing resources:

# Conservative extraction
source.data.maxThreads=2
source.data.maxRowsPerDump=50000

# Conservative loading
target.data.maxThreads=2

# Conservative transfer
mover.maxThreads=4
mover.gzipFiles=true

Optimize for Network-Limited Environments

When bandwidth between source and target is limited:

# Normal extraction
source.data.maxThreads=8
source.data.maxRowsPerDump=100000

# Normal loading
target.data.maxThreads=6

# Maximize transfer efficiency
mover.maxThreads=10
mover.gzipFiles=true

CDC Replication Tuning

For low-latency continuous replication:

# Change extraction
source.change.maxThreads=4
source.change.eventsPerFlush=5000

# Change loading - smaller batches for lower latency
target.change.maxThreads=4
target.change.maxFilesPerBatch=10

# Fast change file transfer
mover.maxThreads=8

Monitoring

During a migration, monitor these indicators to identify bottlenecks:

Indicator	What to Watch	Action if Bottleneck
Extraction rate	Files produced per minute in the output directory	Increase `source.data.maxThreads`
File backlog	Files waiting in the staging area	Increase `mover.maxThreads` or `target.data.maxThreads`
Load rate	Files consumed per minute by the loader	Increase `target.data.maxThreads` or reduce `maxRowsPerDump`
Database connections	Active connections on source/target	Reduce `maxThreads` if approaching limits
Memory usage	System memory on the Wirekite host	Reduce `maxRowsPerDump`, `maxThreads`, or `maxFilesPerBatch`
Disk space	Local file staging directory	Enable `removeFiles=true` or increase mover threads

Use the Web Interface to monitor real-time extraction and loading progress with per-table breakdown, or query the wirekite_progress table directly.

Introduction

Datatype Matrices

Source Guides

Data Movers

Target Guides

Running Wirekite

Tools

Overview

Thread Configuration

Extraction Threads

Loading Threads

Mover Threads

Thread Recommendations

Batch Sizes

Data Extraction

Change Loading

CDC Event Buffering

SQL Server CDC

Compression

Database-Specific Tuning

MySQL

PostgreSQL

Oracle

SQL Server

Snowflake

BigQuery

Spanner

Pipeline Architecture

Tuning by Scenario

Maximize Throughput (Large Migration)

Minimize Resource Usage

Optimize for Network-Limited Environments

CDC Replication Tuning

Monitoring

Introduction

Datatype Matrices

Source Guides

Data Movers

Target Guides

Running Wirekite

Tools

​Overview

​Thread Configuration

​Extraction Threads

​Loading Threads

​Mover Threads

​Thread Recommendations

​Batch Sizes

​Data Extraction

​Change Loading

​CDC Event Buffering

​SQL Server CDC

​Compression

​Database-Specific Tuning

​MySQL

​PostgreSQL

​Oracle

​SQL Server

​Snowflake

​BigQuery

​Spanner

​Pipeline Architecture

​Tuning by Scenario

​Maximize Throughput (Large Migration)

​Minimize Resource Usage

​Optimize for Network-Limited Environments

​CDC Replication Tuning

​Monitoring

Overview

Thread Configuration

Extraction Threads

Loading Threads

Mover Threads

Thread Recommendations

Batch Sizes

Data Extraction

Change Loading

CDC Event Buffering

SQL Server CDC

Compression

Database-Specific Tuning

MySQL

PostgreSQL

Oracle

SQL Server

Snowflake

BigQuery

Spanner

Pipeline Architecture

Tuning by Scenario

Maximize Throughput (Large Migration)

Minimize Resource Usage

Optimize for Network-Limited Environments

CDC Replication Tuning

Monitoring