Skip to main content

Overview

Wirekite performance is controlled by three main levers: thread counts, batch sizes, and compression. This guide covers how each parameter affects performance and provides recommendations for common scenarios.

Thread Configuration

Threads control parallelism across all components. Each component runs its threads independently, so the total resource usage is the sum of all active components.

Extraction Threads

source.data.maxThreads
integer
default:"5"
Number of parallel threads for extracting tables from the source database. Each thread extracts one table at a time.
source.change.maxThreads
integer
default:"5"
Number of parallel threads for the change extractor. Applies to CDC extraction from binlogs, WAL, or transaction logs.

Loading Threads

target.data.maxThreads
integer
default:"5"
Number of parallel threads for loading data files into the target database. Each thread loads one file at a time.
target.change.maxThreads
integer
default:"5"
Number of parallel threads for applying change files to the target database.

Mover Threads

mover.maxThreads
integer
default:"10"
Number of parallel threads for transferring files between source and target staging areas (S3, GCS, or Snowflake internal stage). Movers default to a higher thread count since file transfers are I/O-bound rather than CPU-bound.

Thread Recommendations

ScenarioExtractionLoadingMover
Small database (under 10GB)444
Medium database (10-100GB)868
Large database (over 100GB)8-12810
Resource-constrained host2-42-44
Each extraction thread opens a database connection. Ensure your source database can handle the total connection count: maxThreads per active component. Monitor your database connection limits before increasing threads.

Batch Sizes

Batch sizes control how much data is processed per unit of work. Larger batches improve throughput but use more memory.

Data Extraction

source.data.maxRowsPerDump
integer
default:"200000"
Maximum rows per extracted data file. Tables larger than this threshold are split into multiple files. Applies to MySQL, Oracle, and SQL Server extractors.
source.data.maxPages
integer
default:"8000"
PostgreSQL-specific alternative to maxRowsPerDump. Measures in 8KB database pages rather than rows. The default of 8,000 pages produces files of approximately 64MB.
Sizing guidance:
Row widthRecommended maxRowsPerDumpApproximate file size
Narrow (under 200 bytes)200,000-500,00040-100MB
Medium (200-1000 bytes)100,000-200,00050-200MB
Wide (over 1000 bytes)50,000-100,00050-100MB

Change Loading

target.change.maxFilesPerBatch
integer
default:"30"
Number of change files to process in a single merge operation. Higher values increase throughput but create larger transactions.

CDC Event Buffering

source.change.eventsPerFlush
integer
default:"5000"
Number of change events buffered in memory before flushing to disk. Lower values reduce memory usage but increase I/O operations.

SQL Server CDC

source.change.lsnBatchSize
integer
default:"1000000"
Number of LSN (Log Sequence Number) records to process per batch. SQL Server specific.

Compression

mover.gzipFiles
boolean
default:"false"
Compress data files with gzip before transferring to cloud staging areas. Applies to AWS S3, GCS, and Snowflake internal stage movers.
When to enable compression:
ConditionRecommendation
Network bandwidth is limitedEnable
Cloud egress costs are a concernEnable
CPU is the bottleneckDisable
Files are already small (under 10MB)Disable
Large dataset with high compressibility (text-heavy)Enable
Tabular data typically compresses 5-10x with gzip. A 200MB data file may compress to 20-40MB, significantly reducing transfer time on constrained networks.

Database-Specific Tuning

MySQL

  • The extractor uses SQL_NO_CACHE to bypass the query cache and avoid polluting it during bulk reads
  • Use loadLocal=true if the MySQL server lacks the FILE privilege (uses LOAD DATA LOCAL INFILE instead)
  • Binary data encoding (hexEncoding or base64Encoding) adds overhead; only enable if required by the target

PostgreSQL

  • Uses the COPY protocol for high-speed bulk data movement
  • Tune maxPages instead of maxRowsPerDump for page-aligned extraction
  • sortFiles=true sorts output by primary key after extraction, which can improve target load performance for ordered storage engines

Oracle

  • CDC uses System Change Numbers (SCN) for precise position tracking
  • sortFiles=true enables primary key ordering for extracted data
  • LOB (Large Object) columns are handled inline and can significantly increase row width

SQL Server

  • The change extractor maintains a connection pool of 50 open connections and 25 idle connections
  • useExternalDumper=true uses the BCP utility for faster extraction when available
  • lsnBatchSize controls the granularity of CDC processing; reduce if memory is constrained

Snowflake

  • Loading performance depends on the Snowflake warehouse size specified in the connection string
  • Uses atomic COPY operations with ON_ERROR=ABORT_STATEMENT for data integrity
  • Each loading thread performs a PUT followed by COPY; the warehouse compute scales with thread count

BigQuery

  • Requires a GCS bucket for staging (gcs_bucket parameter)
  • Uses a two-stage load process: CSV files are uploaded to GCS, then loaded via BigQuery load jobs
  • Loading thread count should be conservative to avoid exceeding BigQuery quotas

Spanner

  • Be conservative with maxThreads (start at 4-5) as higher values can trigger Spanner rate limiting
  • Uses mutation batching for CDC operations

Pipeline Architecture

Understanding how the orchestrator coordinates components helps with tuning:
Source DB → [Extractor] → Local Files → [Mover] → Cloud Stage → [Loader] → Target DB
               ↑                           ↑                        ↑
          maxThreads=5              maxThreads=10              maxThreads=5
All three components run in parallel. The extractor produces files, the mover transfers them, and the loader consumes them. The pipeline naturally balances itself:
  • If extraction is fastest, the mover and loader work through a growing backlog
  • If loading is slowest, files accumulate in the staging area until the loader catches up
  • The removeFiles parameter controls whether files are cleaned up after loading
For combined data + change mode, the orchestrator completes the bulk data load first, captures the source position, then starts change replication from that position. Thread counts are configured independently for each phase.

Tuning by Scenario

Maximize Throughput (Large Migration)

When migrating a large database and resources are plentiful:
# Aggressive extraction
source.data.maxThreads=12
source.data.maxRowsPerDump=200000

# Fast loading
target.data.maxThreads=8

# Fast file transfer
mover.maxThreads=10
mover.gzipFiles=false

Minimize Resource Usage

When running on a constrained host or sharing resources:
# Conservative extraction
source.data.maxThreads=2
source.data.maxRowsPerDump=50000

# Conservative loading
target.data.maxThreads=2

# Conservative transfer
mover.maxThreads=4
mover.gzipFiles=true

Optimize for Network-Limited Environments

When bandwidth between source and target is limited:
# Normal extraction
source.data.maxThreads=8
source.data.maxRowsPerDump=100000

# Normal loading
target.data.maxThreads=6

# Maximize transfer efficiency
mover.maxThreads=10
mover.gzipFiles=true

CDC Replication Tuning

For low-latency continuous replication:
# Change extraction
source.change.maxThreads=4
source.change.eventsPerFlush=5000

# Change loading - smaller batches for lower latency
target.change.maxThreads=4
target.change.maxFilesPerBatch=10

# Fast change file transfer
mover.maxThreads=8

Monitoring

During a migration, monitor these indicators to identify bottlenecks:
IndicatorWhat to WatchAction if Bottleneck
Extraction rateFiles produced per minute in the output directoryIncrease source.data.maxThreads
File backlogFiles waiting in the staging areaIncrease mover.maxThreads or target.data.maxThreads
Load rateFiles consumed per minute by the loaderIncrease target.data.maxThreads or reduce maxRowsPerDump
Database connectionsActive connections on source/targetReduce maxThreads if approaching limits
Memory usageSystem memory on the Wirekite hostReduce maxRowsPerDump, maxThreads, or maxFilesPerBatch
Disk spaceLocal file staging directoryEnable removeFiles=true or increase mover threads
Use the Web Interface to monitor real-time extraction and loading progress with per-table breakdown, or query the wirekite_progress table directly.