Overview
Wirekite supports Databricks (using Delta Lake) as a target data lakehouse for:- Schema Loading - Create target tables from Wirekite’s intermediate schema format
- Data Loading - Bulk load extracted data via cloud storage staging
- Change Loading (CDC) - Apply ongoing changes using MERGE operations
Databricks loaders stage data through AWS S3 or Google Cloud Storage buckets before loading to Databricks using COPY INTO commands. Tables are created using Delta Lake format.
Prerequisites
Before configuring Databricks as a Wirekite target, ensure the following requirements are met:Databricks Configuration
- Workspace: Have a Databricks workspace with SQL warehouse or cluster
- Catalog & Schema: Create the target catalog and schema
- Cloud Storage: Configure an S3 bucket or GCS bucket for staging
- Authentication: Configure service principal or personal access token
Cloud Storage Requirements
Schema Loader
The Schema Loader reads Wirekite’s intermediate schema format (.skt file) and generates Databricks-appropriate DDL statements for creating Delta Lake tables.
Tables are created with
USING DELTA format. The Schema Loader also generates merge tables with _wkm suffix for CDC operations.Required Parameters
Path to the Wirekite schema file (
.skt) generated by the Schema Extractor. Must be an absolute path.Output file for CREATE TABLE statements using Delta Lake format.
Output file for CHECK constraints (Databricks has limited constraint support).
Output file for FOREIGN KEY constraints (informational, commented out in output).
Absolute path to the log file for Schema Loader operations.
Optional Parameters
Output file for DROP TABLE IF EXISTS statements. Set to “none” to skip generation.
Output file for recovery table creation SQL. Set to “none” to skip.
When
true, generates merge tables (_wkm suffix) for CDC operations. Set to false if only doing data loads.Data Loader
The Data Loader reads data files from cloud storage and loads them into Databricks tables using COPY INTO operations.Required Parameters
Path to a file containing the Databricks connection string.
Path to the Wirekite schema file used by Schema Loader. Required for table structure information.
Absolute path to the log file for Data Loader operations.
Cloud Storage Parameters (one required)
AWS S3 bucket name for staging data. Required if using AWS.
GCS bucket name for staging data. Required if using Google Cloud.
Optional AWS Parameters
AWS region where the S3 bucket resides.
AWS credentials in format:
aws_access_key_id=KEY,aws_secret_access_key=SECRET. Not required if using IAM roles.Optional GCS Parameters
Path to GCS service account credentials JSON file. Uses Application Default Credentials if not specified.
General Optional Parameters
Maximum number of parallel threads for loading tables.
Set to
true if data was extracted using hex encoding instead of base64. Uses unhex() for decoding.Change Loader
The Change Loader applies ongoing data changes (INSERT, UPDATE, DELETE) to Databricks tables using MERGE operations with shadow tables.Required Parameters
Path to a file containing the Databricks connection string.
Directory containing change files (
.ckt) from the Change Extractor.Working directory for temporary CSV files during merge operations. Must be writable.
Path to the Wirekite schema file for table structure information.
Absolute path to the log file for Change Loader operations.
Cloud Storage Parameters (one required)
AWS S3 bucket name for staging change data. Required if using AWS.
GCS bucket name for staging change data. Required if using Google Cloud.
Optional AWS Parameters
AWS region where the S3 bucket resides.
AWS credentials in format:
aws_access_key_id=KEY,aws_secret_access_key=SECRET.Optional GCS Parameters
Path to GCS service account credentials JSON file.
General Optional Parameters
Maximum number of change files to process in a single batch.
Set to
true if change data was extracted using hex encoding.Orchestrator Configuration
When using the Wirekite Orchestrator, prefix parameters withmover., target.schema., target.data., or target.change..
Example orchestrator configuration for Databricks target (using AWS S3):
