← All Bucket Storages

GOOGLE CLOUD STORAGE

type: gcs

Google's unified object storage. Tightly integrated with BigQuery, Dataflow, and the GCP ecosystem.

PREREQUISITES

SDK: google-cloud-storage — installed automatically by:

dvt sync

CONFIGURATION FIELDS

FIELDREQUIREDDEFAULTDESCRIPTION
typeyesMust be `gcs`
bucketyesGCS bucket name
projectnoGCP project ID
formatnoparquetDefault file format: csv, parquet, json, jsonl, delta, avro

PROFILES.YML EXAMPLE

my_project:
  target: prod_bigquery
  outputs:
    prod_bigquery:
      type: bigquery
      project: my-gcp-project
      dataset: analytics
      location: US

    gcs_lake:
      type: gcs
      bucket: my-gcs-data-lake
      project: my-gcp-project
      format: parquet                  # default format for writes

SOURCES.YML EXAMPLE

sources:
  - name: raw_gcs
    connection: gcs_lake             # must match profiles.yml output name
    tables:
      - name: web_events
      - name: app_sessions

MODEL EXAMPLE — BUCKET AS SOURCE

Sling extracts files from the bucket into the DuckDB cache. Model SQL executes in DuckDB (Postgres-like dialect).

-- models/staging/stg_web_events.sql
-- Extraction: GCS → DuckDB cache → BigQuery
-- Written in DuckDB SQL dialect

{{ config(materialized='table') }}

SELECT
    event_id,
    user_id,
    event_type,
    event_timestamp
FROM {{ source('raw_gcs', 'web_events') }}

MODEL EXAMPLE — BUCKET AS TARGET

Model executes on the default target. Sling streams the result to the bucket in the specified format. Use config(format='...') to override the default format set in profiles.yml.

-- models/export/export_report.sql
-- Materialize to GCS bucket
-- format overrides the default set in profiles.yml

{{ config(
    materialized='table',
    target='gcs_lake',
    format='csv'
) }}

SELECT *
FROM {{ ref('monthly_report') }}

-- Supported format values:
--   'csv'      — comma-separated values
--   'parquet'  — columnar (default, recommended)
--   'json'     — JSON objects
--   'jsonl'    — JSON Lines (one record per line)
--   'delta'    — Delta Lake (schema evolution, time travel)
--   'avro'     — Apache Avro

FILE FORMAT CONFIGURATION

The output format is resolved in order of priority:

1config(format='delta') in model file— highest priority
2format: in profiles.yml output— project default
3parquet— built-in default
FORMATVALUENOTES
ParquetparquetColumnar, compressed. Best for analytics. Default.
Delta LakedeltaSchema evolution, time travel, ACID transactions.
CSVcsvUniversal compatibility. No schema enforcement.
JSONjsonNested structures, human-readable.
JSON LinesjsonlOne record per line. Streamable.
AvroavroSchema embedded, compact binary. Kafka-friendly.
IPC / ArrowipcArrow IPC format. Zero-copy reads.

INCREMENTAL EXTRACTION

Incremental models work with cloud storage sources. DVT resolves the watermark from the target database, extracts only changed rows, and merges them in the DuckDB cache.

-- models/staging/stg_sessions.sql
-- Incremental extraction from GCS

{{ config(
    materialized='incremental',
    unique_key='session_id'
) }}

SELECT session_id, user_id, started_at, ended_at
FROM {{ source('raw_gcs', 'app_sessions') }}
{% if is_incremental() %}
WHERE started_at > (SELECT MAX(started_at) FROM {{ this }})
{% endif %}

NOTES

  • Uses Application Default Credentials — run `gcloud auth application-default login`

DEBUGGING

Use dvt debug to test bucket connectivity. DVT verifies read/write access for each cloud storage connection.

dvt debug