← All Adapters

SPARK

type: spark

Distributed compute engine for large-scale data processing. Used with Hive Metastore, HDFS, and cloud storage.

PREREQUISITES

Driver: pyhive — installed automatically by:

dvt sync

CONFIGURATION FIELDS

FIELDTYPEREQUIREDDEFAULTDESCRIPTION
typestringyesMust be `spark`
hoststringyesThrift server hostname
portintegerno10000Thrift server port
userstringnoUsername (if auth enabled)
schemastringnodefaultDefault database/schema
threadsintegerno4Number of parallel threads

PROFILES.YML EXAMPLE

my_project:
  target: spark_dev
  outputs:
    spark_dev:
      type: spark
      host: spark-thrift.internal.com
      port: 10000
      schema: analytics

SOURCES.YML EXAMPLE

sources:
  - name: hadoop_data
    connection: spark_dev
    schema: raw
    tables:
      - name: logs
      - name: events

INCREMENTAL STRATEGIES

Append Delete+Insert Merge

KNOWN LIMITATIONS

  • Requires Thrift server running
  • Performance depends on cluster configuration