← All Bucket Storages

AMAZON S3

type: s3

Amazon Simple Storage Service. The most widely used cloud object store, ideal for data lakes, backups, and cross-region data distribution.

PREREQUISITES

SDK: boto3 — installed automatically by:

dvt sync

CONFIGURATION FIELDS

FIELDREQUIREDDEFAULTDESCRIPTION
typeyesMust be `s3`
bucketyesS3 bucket name
regionyesus-east-1AWS region (e.g., us-east-1, eu-west-1)
access_key_idnoAWS access key ID. Omit to use IAM role or environment variables
secret_access_keynoAWS secret access key. Omit to use IAM role or environment variables
formatnoparquetDefault file format: csv, parquet, json, jsonl, delta, avro

PROFILES.YML EXAMPLE

my_project:
  target: prod_snowflake
  outputs:
    prod_snowflake:
      type: snowflake
      account: xy12345.us-east-1
      user: DVT_USER
      password: "{{ env_var('SF_PASSWORD') }}"
      database: ANALYTICS
      schema: PUBLIC
      warehouse: COMPUTE_WH

    data_lake:
      type: s3
      bucket: company-data-lake
      region: us-east-1
      access_key_id: "{{ env_var('AWS_ACCESS_KEY_ID') }}"
      secret_access_key: "{{ env_var('AWS_SECRET_ACCESS_KEY') }}"
      format: parquet                  # default format for writes

SOURCES.YML EXAMPLE

sources:
  - name: raw_s3
    connection: data_lake            # must match profiles.yml output name
    tables:
      - name: customers
      - name: orders
      - name: events

MODEL EXAMPLE — BUCKET AS SOURCE

Sling extracts files from the bucket into the DuckDB cache. Model SQL executes in DuckDB (Postgres-like dialect).

-- models/staging/stg_customers.sql
-- Extraction: S3 → DuckDB cache → Snowflake
-- Written in DuckDB SQL dialect

{{ config(materialized='table') }}

SELECT
    id,
    name,
    email,
    created_at,
    CURRENT_TIMESTAMP AS dvt_loaded_at
FROM {{ source('raw_s3', 'customers') }}
WHERE name IS NOT NULL

MODEL EXAMPLE — BUCKET AS TARGET

Model executes on the default target. Sling streams the result to the bucket in the specified format. Use config(format='...') to override the default format set in profiles.yml.

-- models/archive/archive_orders.sql
-- Materialize to S3 bucket
-- format overrides the default set in profiles.yml

{{ config(
    materialized='table',
    target='data_lake',
    format='delta'
) }}

SELECT *
FROM {{ ref('fct_orders') }}
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE)

-- Supported format values:
--   'csv'      — comma-separated values
--   'parquet'  — columnar (default, recommended)
--   'json'     — JSON objects
--   'jsonl'    — JSON Lines (one record per line)
--   'delta'    — Delta Lake (schema evolution, time travel)
--   'avro'     — Apache Avro

FILE FORMAT CONFIGURATION

The output format is resolved in order of priority:

1config(format='delta') in model file— highest priority
2format: in profiles.yml output— project default
3parquet— built-in default
FORMATVALUENOTES
ParquetparquetColumnar, compressed. Best for analytics. Default.
Delta LakedeltaSchema evolution, time travel, ACID transactions.
CSVcsvUniversal compatibility. No schema enforcement.
JSONjsonNested structures, human-readable.
JSON LinesjsonlOne record per line. Streamable.
AvroavroSchema embedded, compact binary. Kafka-friendly.
IPC / ArrowipcArrow IPC format. Zero-copy reads.

INCREMENTAL EXTRACTION

Incremental models work with cloud storage sources. DVT resolves the watermark from the target database, extracts only changed rows, and merges them in the DuckDB cache.

-- models/staging/stg_orders.sql
-- Incremental extraction from S3

{{ config(
    materialized='incremental',
    unique_key='id'
) }}

SELECT id, customer_id, order_date, total, updated_at
FROM {{ source('raw_s3', 'orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

NOTES

  • Credentials via env vars or IAM role — never hardcode keys

DEBUGGING

Use dvt debug to test bucket connectivity. DVT verifies read/write access for each cloud storage connection.

dvt debug