AMAZON S3
type: s3Amazon Simple Storage Service. The most widely used cloud object store, ideal for data lakes, backups, and cross-region data distribution.
PREREQUISITES
SDK: boto3 — installed automatically by:
dvt sync
CONFIGURATION FIELDS
| FIELD | REQUIRED | DEFAULT | DESCRIPTION |
|---|---|---|---|
| type | yes | — | Must be `s3` |
| bucket | yes | — | S3 bucket name |
| region | yes | us-east-1 | AWS region (e.g., us-east-1, eu-west-1) |
| access_key_id | no | — | AWS access key ID. Omit to use IAM role or environment variables |
| secret_access_key | no | — | AWS secret access key. Omit to use IAM role or environment variables |
| format | no | parquet | Default file format: csv, parquet, json, jsonl, delta, avro |
PROFILES.YML EXAMPLE
my_project:
target: prod_snowflake
outputs:
prod_snowflake:
type: snowflake
account: xy12345.us-east-1
user: DVT_USER
password: "{{ env_var('SF_PASSWORD') }}"
database: ANALYTICS
schema: PUBLIC
warehouse: COMPUTE_WH
data_lake:
type: s3
bucket: company-data-lake
region: us-east-1
access_key_id: "{{ env_var('AWS_ACCESS_KEY_ID') }}"
secret_access_key: "{{ env_var('AWS_SECRET_ACCESS_KEY') }}"
format: parquet # default format for writesSOURCES.YML EXAMPLE
sources:
- name: raw_s3
connection: data_lake # must match profiles.yml output name
tables:
- name: customers
- name: orders
- name: eventsMODEL EXAMPLE — BUCKET AS SOURCE
Sling extracts files from the bucket into the DuckDB cache. Model SQL executes in DuckDB (Postgres-like dialect).
-- models/staging/stg_customers.sql
-- Extraction: S3 → DuckDB cache → Snowflake
-- Written in DuckDB SQL dialect
{{ config(materialized='table') }}
SELECT
id,
name,
email,
created_at,
CURRENT_TIMESTAMP AS dvt_loaded_at
FROM {{ source('raw_s3', 'customers') }}
WHERE name IS NOT NULLMODEL EXAMPLE — BUCKET AS TARGET
Model executes on the default target. Sling streams the result to the bucket in the specified format. Use config(format='...') to override the default format set in profiles.yml.
-- models/archive/archive_orders.sql
-- Materialize to S3 bucket
-- format overrides the default set in profiles.yml
{{ config(
materialized='table',
target='data_lake',
format='delta'
) }}
SELECT *
FROM {{ ref('fct_orders') }}
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE)
-- Supported format values:
-- 'csv' — comma-separated values
-- 'parquet' — columnar (default, recommended)
-- 'json' — JSON objects
-- 'jsonl' — JSON Lines (one record per line)
-- 'delta' — Delta Lake (schema evolution, time travel)
-- 'avro' — Apache AvroFILE FORMAT CONFIGURATION
The output format is resolved in order of priority:
config(format='delta') in model file— highest priorityformat: in profiles.yml output— project defaultparquet— built-in default| FORMAT | VALUE | NOTES |
|---|---|---|
| Parquet | parquet | Columnar, compressed. Best for analytics. Default. |
| Delta Lake | delta | Schema evolution, time travel, ACID transactions. |
| CSV | csv | Universal compatibility. No schema enforcement. |
| JSON | json | Nested structures, human-readable. |
| JSON Lines | jsonl | One record per line. Streamable. |
| Avro | avro | Schema embedded, compact binary. Kafka-friendly. |
| IPC / Arrow | ipc | Arrow IPC format. Zero-copy reads. |
INCREMENTAL EXTRACTION
Incremental models work with cloud storage sources. DVT resolves the watermark from the target database, extracts only changed rows, and merges them in the DuckDB cache.
-- models/staging/stg_orders.sql
-- Incremental extraction from S3
{{ config(
materialized='incremental',
unique_key='id'
) }}
SELECT id, customer_id, order_date, total, updated_at
FROM {{ source('raw_s3', 'orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}NOTES
- ⚠Credentials via env vars or IAM role — never hardcode keys
DEBUGGING
Use dvt debug to test bucket connectivity. DVT verifies read/write access for each cloud storage connection.
dvt debug