Overview Quickstart CLI Commands .sql Models .py Models profiles.yml sources.yml Execution Paths

AMAZON S3

Amazon Simple Storage Service. The most widely used cloud object store, ideal for data lakes, backups, and cross-region data distribution.

PREREQUISITES

SDK: boto3 — installed automatically by:

dvt sync

CONFIGURATION FIELDS

FIELD	REQUIRED	DEFAULT	DESCRIPTION
type	yes	—	Must be `s3`
bucket	yes	—	S3 bucket name
region	yes	us-east-1	AWS region (e.g., us-east-1, eu-west-1)
access_key_id	no	—	AWS access key ID. Omit to use IAM role or environment variables
secret_access_key	no	—	AWS secret access key. Omit to use IAM role or environment variables
format	no	parquet	Default file format: csv, parquet, json, jsonl, delta, avro

PROFILES.YML EXAMPLE

my_project:
  target: prod_snowflake
  outputs:
    prod_snowflake:
      type: snowflake
      account: xy12345.us-east-1
      user: DVT_USER
      password: "{{ env_var('SF_PASSWORD') }}"
      database: ANALYTICS
      schema: PUBLIC
      warehouse: COMPUTE_WH

    data_lake:
      type: s3
      bucket: company-data-lake
      region: us-east-1
      access_key_id: "{{ env_var('AWS_ACCESS_KEY_ID') }}"
      secret_access_key: "{{ env_var('AWS_SECRET_ACCESS_KEY') }}"
      format: parquet                  # default format for writes

SOURCES.YML EXAMPLE

sources:
  - name: raw_s3
    connection: data_lake            # must match profiles.yml output name
    tables:
      - name: customers
      - name: orders
      - name: events

MODEL EXAMPLE — BUCKET AS SOURCE

Sling extracts files from the bucket into the DuckDB cache. Model SQL executes in DuckDB (Postgres-like dialect).

-- models/staging/stg_customers.sql
-- Extraction: S3 → DuckDB cache → Snowflake
-- Written in DuckDB SQL dialect

{{ config(materialized='table') }}

SELECT
    id,
    name,
    email,
    created_at,
    CURRENT_TIMESTAMP AS dvt_loaded_at
FROM {{ source('raw_s3', 'customers') }}
WHERE name IS NOT NULL

MODEL EXAMPLE — BUCKET AS TARGET

Model executes on the default target. Sling streams the result to the bucket in the specified format. Use config(format='...') to override the default format set in profiles.yml.

-- models/archive/archive_orders.sql
-- Materialize to S3 bucket
-- format overrides the default set in profiles.yml

{{ config(
    materialized='table',
    target='data_lake',
    format='delta'
) }}

SELECT *
FROM {{ ref('fct_orders') }}
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE)

-- Supported format values:
--   'csv'      — comma-separated values
--   'parquet'  — columnar (default, recommended)
--   'json'     — JSON objects
--   'jsonl'    — JSON Lines (one record per line)
--   'delta'    — Delta Lake (schema evolution, time travel)
--   'avro'     — Apache Avro

FILE FORMAT CONFIGURATION

The output format is resolved in order of priority:

1 in model file— highest priority

2 in profiles.yml output— project default

3— built-in default

FORMAT	VALUE	NOTES
Parquet	parquet	Columnar, compressed. Best for analytics. Default.
Delta Lake	delta	Schema evolution, time travel, ACID transactions.
CSV	csv	Universal compatibility. No schema enforcement.
JSON	json	Nested structures, human-readable.
JSON Lines	jsonl	One record per line. Streamable.
Avro	avro	Schema embedded, compact binary. Kafka-friendly.
IPC / Arrow	ipc	Arrow IPC format. Zero-copy reads.

INCREMENTAL EXTRACTION

Incremental models work with cloud storage sources. DVT resolves the watermark from the target database, extracts only changed rows, and merges them in the DuckDB cache.

-- models/staging/stg_orders.sql
-- Incremental extraction from S3

{{ config(
    materialized='incremental',
    unique_key='id'
) }}

SELECT id, customer_id, order_date, total, updated_at
FROM {{ source('raw_s3', 'orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

NOTES

⚠Credentials via env vars or IAM role — never hardcode keys

DEBUGGING

Use dvt debug to test bucket connectivity. DVT verifies read/write access for each cloud storage connection.

dvt debug