Extractor Prerequisites

To configure Postgres as an extractor you must have the following prereqs met.
  1. Make sure that the database version is 10.x and above, as the Wirekite Change Extractor uses the “pgoutput” replication plugin.
  2. Create a user that will be used to execute the extraction operation. The below discussion will assume that user is called wirekite.
  3. Make sure the tables being extracted are readable by wirekite.
  4. As a PostgreSQL superuser - typically postgres - dogrant all on schema public to wirekite;
  5. Also make sure that the replica is configured to usereplica identity full.
  6. Wirekite will also create some internal tables: wirekite_progress and wirekite_action. Make sure the wirekite user has read and write access to those tables.
  7. Make sure that the postgres user has write permission to the directories that wirekite will write to, and that there is enough space to store the data
  8. When doing a Data extract make sure that replication is stopped and the database instance being dumped from is quiescent. This is to make sure that the entire dump is consistent to a specific position.
  9. After doing a Data Extract, but before re-enabling replication, you should create a REPLICATION SLOT for the Change Extractor.
  10. It is highly preferable to run wirekite on a **replica **of a production-facing database instance, since wirekite can generate load on the database server, particularly as the instance needs to be quiesced during Data extraction as mentioned above.

Schema Extractor

Schema Extractor is a wirekite utility that extracts the schema related information for some or all the tables on a database instance. It extracts a database-independent notion of the schema for the tables to a file.

The Schema Extractor can be run independently or run as a workflow by the Orchestrator (check Orchestrator section). However you run it, the Schema Extractor needs certain configuration variables, which will have the same configurations whether you run it using Schema Extractor or the Orchestrator.
Here are the configuration variables required to run the Schema Extractor.
dsnFile
string
required
dsnFile is a file containing the connection information for the instance that will be accessed by the Schema Extractor.
The connection information has the following format
postgres://username:password@host:port/database
outputDirectory
string
required
outputDirectory is the file system directory that the Schema Extractor will write output files. The path should be absolute and writable by the wirekite binaries. Wirekite will write a file called wirekite_schema.skt in the outputDirectory.
tablesFile
string
required
tablesFile is a file with a list of all the tables whose schemas are to be extracted.
The format of the tablesFile is as follows.
database1.table1
database2.table2
database3.table3
logFile
string
required
logFile is the full path of the log file where the Schema Extractor will write the logging information.

Data Extractor

The Data Extractor is a wirekite utility that extracts the data for given tables. The Data Extractor can be run independently or run in a workflow by the Orchestrator (check Orchestrator section). The Data Extractor needs certain configuration variables, which will have the same values whether you run it using Data Extractor or Orchestrator.
dsnFile
string
required
dsnFile is a file containing the connection information for the PostgreSQL database.
The connection information has the following format
postgres://username:password@host:port/database
outputDirectory
string
required
outputDirectory represents the directory to which wirekite will write the output files. The path should be absolute and writable by the wirekite binaries. Wirekite will write files called table_name.number.dkt in the outputDirectory.
tablesFile
string
required
tablesFile is a file with a list of all the tables that are being extracted.
The format of the tablesFile is as follows.
database1.table1
database2.table2
database3.table3
logFile
string
required
logFile refers to the full path of the logfile where the extractor will write the logging information.
maxThreads
integer
maxThreads is number of parallel threads that will run to extract data. One thread usually uses 1 cpu and there might be contention if number of threads are too high than the number of cpus. If you’re using a dedicated 32 CPU host for the extraction, we recommend using 32 for this number.
maxPages
integer
maxPages is the maximum number of PostgreSQL storage pages to visit by a single extraction thread. As pages have varying numbers of rows, this coresponds somewhat loosely to the number of rows to be extracted at a time.
hexEncoding
boolean
default:"false"
hexEncoding is used to specify whether string data will be extracted as hex. The default is base64. hexEncoding should only be used if you’re loading to a target that doesn’t handle base64, such as Snowflake.

Change Extractor

Change Extractor is a wirekite utility that extracts ongoing changes to rows - Inserts, Updates, Deletes - for given tables. The Change Extractor can be run independently or run as part of the Orchestrator (check Orchestrator section). However you run it, the Change Extractor needs certain configuration variables, which will have the same values whether you run the Change Extractor directly or as part of a workflow using the Orchestrator.
dsnFile
string
required
dsnFile is the file containing the connection information for the PostgreSQL database.
The connection information has the following format
postgres://username:password@host:port/database
outputDirectory
string
required
outputDirectory is the directory to which wirekite will write the output files. The path should be absolute and writable by the operating-system user executing the wirekite binaries. Wirekite will write files called number.ckt in the outputDirectory. The number will be an ever increasing number starting from 0.
tablesFile
string
required
tablesFile is a file with a list of all the tables whose changes are being extracted.
The format of the tablesFile is as follows.
database1.table1
database2.table2
database3.table3
logFile
string
required
logFile refers to the full path of the logfile where the extractor will write the logging information.
processCommits
boolean
default:"false"
Whether to output syntax marking BEGIN and COMMIT or not. We guarantee that a single output file will completely contain its transactions and that no transactions will “span” output files.
eventsPerFlush
number
default:"5000"
The number of change events at which the Change Extractor will flush the buffer to the file. This is to ensure the regular flushing to the output files occur for performance reasons. Note that the buffer will be flushed at a transaction boundary after eventsPerFlush events have been processed, so an individual file file may contain more events than this.
exitWhenIdle
boolean
default:"false"
Whether to exit when all pending changes have been processed, or wait until new changes to show up and “run forever”.
replicationSlot
string
required
The replication slot from where to start extracting changes. Generally this is the position where the data extract was taken.