← All Adapters
SPARK
type: sparkDistributed compute engine for large-scale data processing. Used with Hive Metastore, HDFS, and cloud storage.
PREREQUISITES
Driver: pyhive — installed automatically by:
dvt sync
CONFIGURATION FIELDS
| FIELD | TYPE | REQUIRED | DEFAULT | DESCRIPTION |
|---|---|---|---|---|
| type | string | yes | — | Must be `spark` |
| host | string | yes | — | Thrift server hostname |
| port | integer | no | 10000 | Thrift server port |
| user | string | no | — | Username (if auth enabled) |
| schema | string | no | default | Default database/schema |
| threads | integer | no | 4 | Number of parallel threads |
PROFILES.YML EXAMPLE
my_project:
target: spark_dev
outputs:
spark_dev:
type: spark
host: spark-thrift.internal.com
port: 10000
schema: analyticsSOURCES.YML EXAMPLE
sources:
- name: hadoop_data
connection: spark_dev
schema: raw
tables:
- name: logs
- name: eventsINCREMENTAL STRATEGIES
✓ Append✓ Delete+Insert✓ Merge
KNOWN LIMITATIONS
- ⚠Requires Thrift server running
- ⚠Performance depends on cluster configuration