Skip to main content

Overview

Wirekite supports Databricks (using Delta Lake) as a target data lakehouse for:
  • Schema Loading - Create target tables from Wirekite’s intermediate schema format
  • Data Loading - Bulk load extracted data via cloud storage staging
  • Change Loading (CDC) - Apply ongoing changes using MERGE operations
Databricks loaders stage data through AWS S3 or Google Cloud Storage buckets before loading to Databricks using COPY INTO commands. Tables are created using Delta Lake format.

Prerequisites

Before configuring Databricks as a Wirekite target, ensure the following requirements are met:

Databricks Configuration

  1. Workspace: Have a Databricks workspace with SQL warehouse or cluster
  2. Catalog & Schema: Create the target catalog and schema
  3. Cloud Storage: Configure an S3 bucket or GCS bucket for staging
  4. Authentication: Configure service principal or personal access token

Cloud Storage Requirements

Either an AWS S3 bucket or GCS bucket is required for staging data. The bucket must be accessible from both the loader host and Databricks workspace.
For AWS, ensure IAM credentials have read/write access to the S3 bucket. For GCS, use a service account with Storage Object Admin role.

Schema Loader

The Schema Loader reads Wirekite’s intermediate schema format (.skt file) and generates Databricks-appropriate DDL statements for creating Delta Lake tables.
Tables are created with USING DELTA format. The Schema Loader also generates merge tables with _wkm suffix for CDC operations.

Required Parameters

schemaFile
string
required
Path to the Wirekite schema file (.skt) generated by the Schema Extractor. Must be an absolute path.
createTableFile
string
required
Output file for CREATE TABLE statements using Delta Lake format.
createConstraintFile
string
required
Output file for CHECK constraints (Databricks has limited constraint support).
createForeignKeyFile
string
required
Output file for FOREIGN KEY constraints (informational, commented out in output).
logFile
string
required
Absolute path to the log file for Schema Loader operations.

Optional Parameters

dropTableFile
string
default:"none"
Output file for DROP TABLE IF EXISTS statements. Set to “none” to skip generation.
createRecoveryTablesFile
string
default:"none"
Output file for recovery table creation SQL. Set to “none” to skip.
createMergeTables
boolean
default:"true"
When true, generates merge tables (_wkm suffix) for CDC operations. Set to false if only doing data loads.

Data Loader

The Data Loader reads data files from cloud storage and loads them into Databricks tables using COPY INTO operations.

Required Parameters

dsnFile
string
required
Path to a file containing the Databricks connection string.
Connection string format:
databricks://token:TOKEN@HOSTNAME:443/default?catalog=CATALOG&schema=SCHEMA
schemaFile
string
required
Path to the Wirekite schema file used by Schema Loader. Required for table structure information.
logFile
string
required
Absolute path to the log file for Data Loader operations.

Cloud Storage Parameters (one required)

awsBucket
string
AWS S3 bucket name for staging data. Required if using AWS.
gcsBucket
string
GCS bucket name for staging data. Required if using Google Cloud.

Optional AWS Parameters

awsRegion
string
default:"us-east-1"
AWS region where the S3 bucket resides.
awsCredentials
string
AWS credentials in format: aws_access_key_id=KEY,aws_secret_access_key=SECRET. Not required if using IAM roles.

Optional GCS Parameters

gcsCredentials
string
Path to GCS service account credentials JSON file. Uses Application Default Credentials if not specified.

General Optional Parameters

maxThreads
integer
default:"5"
Maximum number of parallel threads for loading tables.
hexEncoding
boolean
default:"false"
Set to true if data was extracted using hex encoding instead of base64. Uses unhex() for decoding.
The Data Loader creates temporary staging tables that are automatically cleaned up after successful loads.

Change Loader

The Change Loader applies ongoing data changes (INSERT, UPDATE, DELETE) to Databricks tables using MERGE operations with shadow tables.

Required Parameters

dsnFile
string
required
Path to a file containing the Databricks connection string.
inputDirectory
string
required
Directory containing change files (.ckt) from the Change Extractor.
workDirectory
string
required
Working directory for temporary CSV files during merge operations. Must be writable.
schemaFile
string
required
Path to the Wirekite schema file for table structure information.
logFile
string
required
Absolute path to the log file for Change Loader operations.

Cloud Storage Parameters (one required)

awsBucket
string
AWS S3 bucket name for staging change data. Required if using AWS.
gcsBucket
string
GCS bucket name for staging change data. Required if using Google Cloud.

Optional AWS Parameters

awsRegion
string
default:"us-east-1"
AWS region where the S3 bucket resides.
awsCredentials
string
AWS credentials in format: aws_access_key_id=KEY,aws_secret_access_key=SECRET.

Optional GCS Parameters

gcsCredentials
string
Path to GCS service account credentials JSON file.

General Optional Parameters

maxFilesPerBatch
integer
default:"30"
Maximum number of change files to process in a single batch.
hexEncoding
boolean
default:"false"
Set to true if change data was extracted using hex encoding.
The Change Loader should not start until the Data Loader has successfully completed the initial full load.

Orchestrator Configuration

When using the Wirekite Orchestrator, prefix parameters with mover., target.schema., target.data., or target.change.. Example orchestrator configuration for Databricks target (using AWS S3):
# Main configuration
source=postgres
target=databricks

# Data mover (S3)
mover.awsBucket=my-staging-bucket
mover.awsRegion=us-east-1
mover.dataDirectory=/opt/wirekite/output/data
mover.logFile=/var/log/wirekite/data-mover.log
mover.maxThreads=10
mover.removeFiles=true
mover.awsCredentials=aws_access_key_id=KEY,aws_secret_access_key=SECRET

# Schema loading
target.schema.schemaFile=/opt/wirekite/output/schema/wirekite_schema.skt
target.schema.createTableFile=/opt/wirekite/output/schema/create_tables.sql
target.schema.createConstraintFile=/opt/wirekite/output/schema/constraints.sql
target.schema.createForeignKeyFile=/opt/wirekite/output/schema/foreign_keys.sql
target.schema.logFile=/var/log/wirekite/schema-loader.log

# Data loading
target.data.dsnFile=/opt/wirekite/config/databricks.dsn
target.data.awsBucket=my-staging-bucket
target.data.awsRegion=us-east-1
target.data.schemaFile=/opt/wirekite/output/schema/wirekite_schema.skt
target.data.logFile=/var/log/wirekite/data-loader.log
target.data.maxThreads=8
target.data.awsCredentials=aws_access_key_id=KEY,aws_secret_access_key=SECRET

# Change loading (CDC)
target.change.dsnFile=/opt/wirekite/config/databricks.dsn
target.change.awsBucket=my-staging-bucket
target.change.awsRegion=us-east-1
target.change.inputDirectory=/opt/wirekite/output/changes
target.change.workDirectory=/opt/wirekite/work
target.change.schemaFile=/opt/wirekite/output/schema/wirekite_schema.skt
target.change.logFile=/var/log/wirekite/change-loader.log
target.change.maxFilesPerBatch=30
target.change.awsCredentials=aws_access_key_id=KEY,aws_secret_access_key=SECRET
For complete Orchestrator documentation, see the Execution Guide.