Post

dbt Notes

dbt Notes

I’ve recently created a learning log for this year (2025). The first topic that I wanted to go over is dbt (data-build-tool). Here are some of my notes. I’m putting more materials or notes here about dbt as I go along.

To start, I’ve taken a Udemy course The Complete dbt (Data Build Tool) Bootcamp: Zero to Hero as an introduction to dbt. Here is the repo where I’ve been doing the hands-on throughout the course: dbtlearn. I’ve started the course on Jan 24, and finished on Feb 2. (Sharing my certificate). I’m hoping these notes would be a good reference in the future.

dbt and Snowflake setup

Here’s what I did for the setup:

  • Created a Snowflake trial account. It’s good for 30 days with $400 credit.
  • Setup Snowflake warehouse and permissions for dbt, ingested Airbnb data hosted in AWS S3 into the raw tables.
  • In my local machine, installed Python 3.12.7 (I needed to downgrade from Python 3.13.0 as it’s not yet supported by dbt). I also installed virtualenv using Homebrew but I realized I could have just installed it using pip.
  • Installed dbt-snowflake in the virtual environment (always use a Python virtual environment!)
  • Created a dbt user directory mkdir ~/.dbt
  • Initialized dbt project (dbt init dbtlearn) and connected it to the Snowflake account. I also learned just to delete the project directory when deleting a project (I made a configuration mistake! :p)
  • Updated dbt_project.yml to remove examples; deleted examples from the project directory
  • Installed dbt plugin “Power User for dbt core” in VS Code and set it up.

Models and materialization

tables-graph

Here’s the final outcome in the data warehouse

tables-in-wh

The project YAML file has been modified to follow these defaults; the + sign before +materialized means that it’s a dbt keyword and not a directory

1
2
3
4
5
6
7
models:
  dbtlearn:
    +materialized: view
    dim:
      +materialized: table
    src:
      +materialized: ephemeral

The following configuration was added for the the two dim tables so they materialized as views instead of tables, overriding the default state in the project YAML.

1
2
3
4
5
{{
    config(
        materialized = 'view'
    )
}}

Using the dbt plugin was also convenient in that when testing the SQL, I only needed to run it within the editor and it’s able to return the results of the query (no need to switch into Snowflake’s interface). It can also be used to build the model using GUI.

tables-graph

dbt models are SQL definitions that can be materialized (or not) into tables or views. These are stored as SQL files under dbtlearn/models/. They consist of select statements but contain other features such as: semantic dependencies with other models, scripts and macros.

dbtprpoject/target/run/dbtlearn/models - contains the final compiled models. Files here can be investigated when troubleshooting.

CTE (common table expressions) are widely used and recommended as they are readable, easy to maintain, and make complex queries readable. They are similar to views but are not stored

Commands to build models include:

  • dbt run - materializes the models
  • dbt run --full-refresh - rebuilds all tables including incremental tables

There are four (4) types of materialization in dbt

typeusedon’t use
viewlight weight representation;
the model isn’t being reused often
the model is being read several times
tablethe model is being read repeatedlysingle-use models; incremental load
incrementalfact, event tables;
appends to tables
update historical records
ephemerali.e. no materialization;
want to use an alias to a column
model is read several times

Here’s an example of a dbt model using an incremental load, and Jinja

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{{
    config(
        materialized = 'incremental',
        on_schema_change = 'fail'
    )
}}
with src_reviews as (
    select * from {{ ref('src_reviews') }}
)

select * from src_reviews
where review_text is not null

{% if is_incremental() %}
    and review_date > (select max(review_date) from {{this}})
{% endif %}

When switching from a materialized model to an ephemeral model, delete the tables manually in the data warehouse as dbt won’t delete the tables or views itself.

Seeds

  • local files yet to be uploaded to the warehouse using dbt
  • contained in this path dbtlearn/seeds e.g. dbtlearn/seeds/seed_full_moon_dates.csv
  • in the project YAML file, the paths of seeds are defined as
1
seed-paths: ["seeds"]
  • dbt seed command populates seed in Snowflake; this results in a table. (Snowflake can infer the column schema from file)

Source

  • an abstraction layer that sits on the top of the input (raw) tables e.g. those defined in models /dbtlearn/models/src/src_*.sql
  • contains extra features such as checking for data freshness
  • to define models as source, create a separate source.yml file under the src model directory dbtlearn/models/sources.yml

e.g.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
version: 2

sources:
    - name: airbnb
      schema: raw
      tables:
        - name: listings
          identifier: raw_listings

        - name: hosts
          identifier: raw_hosts

        - name: reviews
          identifier: raw_reviews
          loaded_at_field: date
          freshness:
            warn_after: {count: 1, period: hour}
            error_after: {count: 24, period: hour}
  • in models/src/src_listings.sql instead of directly referencing the Snowflake table as AIRBNB.RAW.RAW_LISTINGS, it can now be abstracted to
    • {​{ source('airbnb', 'listings') }​} using names listed in the source YAML file

e.g.

1
2
3
4
5
with raw_listings as (
select * from {{ source('airbnb', 'listings') }}
)

-- continue code here
  • dbt compile checks template tags, and connections; dbt goes through models, YAML files, tests etc and checks if all references are correct

Source freshness

  • in the source YAML file, these lines under reviews define freshness
    1
    2
    3
    4
    
            loaded_at_field: date
            freshness:
              warn_after: {count: 1, period: hour}
              error_after: {count: 24, period: hour}
    
  • this means
    • give warning if there’s no data that’s less than an hour fresh
    • return an error if there are no new data in 24 hours
  • dbt source freshness is the command to use to check the freshness

Snapshots

  • Snapshots in dbt implement Type 2 SCD
  • Path is as defined in the project YAML file, and consist of SQL files
    1
    
    snapshot-paths: ["snapshots]
    
  • example code dbtlearn/snapshots/scd_raw_listings.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{% snapshot scd_raw_listings %}

{{
    config(
        target_schema = 'dev',
        unique_key='id',
        strategy='timestamp',
        updated_at='updated_at',
        invalidate_hard_deletes = True
    )
}}

select * from {{ source('airbnb', 'listings') }}

{% endsnapshot %}
  • config line defines how and where to create the snapshot; in this example it is created as a new table in Snowflake AIRBNB.DEV.SCD_RAW_LISTINGS
  • invalidate_hard_deletes = True means that if the record is deleted, dbt will replace the dbt_valid_to date to the snapshot date; if equals to False, it will not capture the delete
  • the table is materialized with additional columns
  • dbt snapshot takes the snapshots
snapshot-table
  • as records are updated in the raw table, the SCD table captures the changes as new rows with the dbt_valid_to fields updated
snapshot-table

Tests

  • There are two kinds of tests in the dbt
    • singular
      • SQL queries stored in the tests folder
      • the expected result set is empty
    • generic
      • built-in tests from the core installation of dbt
        • unique
        • not_null
        • accepted_values
        • relationships - takes a column value and makes sure it is a valid reference to records in another table
      • custom tests and tests imported from dbt packages
  • unique and not_null are built in warehouse databases but may not be present in newer types like data lake or lakehouses
  • dbt test runs the tests

Singular tests

  • SQL in tests folder e.g. dbtlearn/tests/dim_listings_minimum_nights.sql
  • example
1
2
3
4
5
6
select
    *
from
    {{ ref('dim_listings_cleansed') }}
where minimum_nights < 1 
limit 10
  • dbt test --help brings up the docs
  • dbt test --select dim_listings_cleansed - runs the tests associated with the dim_listings_cleansed model only

Generic tests

  • define schema.yml under models folder
    • name does not matter; can be one file or several files
    • can be used for extra configurations, tests, documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
version: 2

models:
  - name: dim_listings_cleansed
    columns:

      - name: listing_id
        tests:
          - unique
          - not_null

      - name: host_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_hosts_cleaned')
              field: host_id

      - name: room_type
        tests:
          - accepted_values:
              values: ['Entire home/apt',
                        'Private room',
                        'Shared room',
                        'Hotel room']

Custom generic tests

  • macros with special signature
  • example dbtlearn/macros/positive_value.sql
    1
    2
    3
    4
    5
    6
    7
    8
    
    {% test positive_value(model, column_name) %}
    select
      *
    from
      {{ model }}
    where
      {{ column_name }} < 1
    {% endtest %}
    
  • add these in schema.yml under model
1
2
3
4
5
6
7
8
9
10
version: 2
models:
  - name: dim_listings_cleansed
    columns:

      - <other tests here>
	
      - name: minimum_nights
        tests:
          - positive_value

Macros

  • Jinja templates created in the macros folder
  • can be used in model definitions and tests
    • test is a special macro for implementing generic tests
  • example: dbtlearn/macros/no_nulls_in_columns.sql
    • - in the for loop means remove whitespaces
1
2
3
4
5
6
7
{% macro no_nulls_in_columns(model) %}
    select * from {{ model }} where
    {% for col in adapter.get_columns_in_relation(model) -%}
        {{ col.column}} is null or
    {% endfor %}
    FALSE
{% endmacro %}
  • use in dbtlearn/tests/no_nulls_in_dim_listings.sql
    1
    
    {{ no_nulls_in_columns(ref('dim_listings_cleansed')) }}
    

Third-party packages

  • extend dbt using dbt packages; allows access to a plethora of macros and tests
  • go to hub.getdbt.com
  • example dbt-utils > generate_surrogate_key
    • “cross-database way” if the field values are the same in two or more instances of the database (even if different types of databases), the hash will still be the same

to install:

  • create a file in the dbt project root directory called dbtlearn/packages.yml
1
2
3
packages:
  - package: dbt-labs/dbt_utils
    version: 1.3.0
  • run dbt deps to install
    • might not work behind firewall
    • work around: go to the GitHub repository of the package, copy it manually to the dbt packages folder dbtlearn/dbt_packages
  • example of usage: dbtlearn/models/fct/fct_reviews.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{{
    config(
        materialized = 'incremental',
        on_schema_change = 'fail'
    )
}}
with src_reviews as (
    select * from {{ ref('src_reviews') }}
)

select 
{{ dbt_utils.generate_surrogate_key(['listing_id', 'review_date', 'reviewer_name', 'review_text'])}} as review_id,
* from src_reviews
where review_text is not null

{% if is_incremental() %}
    and review_date > (select max(review_date) from {{this}})
{% endif %}
  • run dbt run --select fct_reviews --full-refresh - specify full refresh so that the run doesn’t error out on_schema_change
  • review_id has been added to the table in Snowflake
surrogate-key

Documentation

  • Should be kept close to the actual code as possible
  • Defined in two ways
    • in YAML files (e.g. schema.yml)
    • in stand alone markdown files
  • dbt combines these documentations in an html service that can be used on your own server
  • otherwise, dbt ships with a lightweight documentation web server
  • overview.md is a special file that can be used to customize landing pages
  • dbt can also be configured to store assets like downloadables and images in a special folder that can be referenced in a markdown file

Basic documentation

  • update schema.yml to add descriptions
1
2
3
4
5
6
7
8
9
10
11
12
version: 2

models:
  - name: dim_listings_cleansed
    description: Cleansed table which contains Airbnb listings.
    columns:

      - name: listing_id
        description: Primary key for the listing
        tests:
          - unique
          - not_null
  • run dbt docs generate to generate docs at target folder
    • catalog will be written to dbtlearn/target/catalog.json
    • html in dbtlearn/target/index.html
  • run dbt docs serve to render the documentation in the built-in dbt webserver
dbt-docs

Markdown-based docs, custom overview page, and assets

  • create a markdown file e.g. in the models directory dbtlearn/models/docs.md
1
2
3
4
5
6
{​% docs dim_listing_cleansed__minimum_nights %​}
Minimum number of nights required to rent this property.

Keep in mind that old listings might have `minimum_nights` set to 0 in the source tables. Our cleansing algorithm updates this to `1`.

{​% enddocs %​}
  • add to description in schema.yml
1
2
3
4
       - name: minimum_nights
        description: '{​{ doc("dim_listing_cleansed__minimum_nights") }​}'
        tests:
          - positive_value
  • can also create an overview doc in dbtlearn/models/overview.md, e.g.
1
2
3
4
5
6
7
8
9
{​% docs __overview__ %​}
# Airbnb pipeline

Hey, welcome to our Airbnb pipeline documentation!

Here is the schema of our input data:
![input schema](https://dbtlearn.s3.us-east-2.amazonaws.com/input_schema.png)

{​% enddocs %​}
  • to reference assets locally, add a folder in the project root assets
    • make sur to define this path in project.yml
1
asset-paths: ["assets"]
  • replace the url path in the markdown file e.g.
1
2
3
4
5
6
7
8
9
{​% docs __overview__ %​}
# Airbnb pipeline

Hey, welcome to our Airbnb pipeline documentation!

Here is the schema of our input data:
![input schema](assets/input_schema.png)

{​% enddocs %​}
  • running dbt docs generate will generate the assets into target folder e.g. target/assets/input_schema.png

Lineage graph (data flow DAG)

  • the lineage graph is accessible in the docs UI
dbt-lineage-graph
  • can set filter in the UI e.g.
    • --select +src_hosts+
      • + preceeding src_hosts - show dependencies
      • + after src_hosts - everything that builds on src_hosts
    • --exclude dim_listings_w_hosts+
      • hide dim_listings_w_hosts and everything that depends on it
  • can also use these filters in other dbt commands dbt run --help e.g.
    • e.g. dbt run --select src_hosts+
      • src_hosts and every model that builds from src_hosts will execute

Analyses

  • can run a query using dbt syntax and references but not necessarily create a model
  • version control, and execute against the data warehouse
  • e.g. dbtlearn/analyses/full_moon_no_sleep.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
with mart_fullmoon_reviews as (
    select * from {{ ref('mart_fullmoon_reviews') }}
)

select
    is_full_moon,
    review_sentiment,
    count(*) as reviews
from
    mart_fullmoon_reviews
group by
    is_full_moon,
    review_sentiment
order by
    is_full_moon,
    review_sentiment
  • run dbt compile
  • parsed SQL will be written to target e.g. dbtlearn/target/compiled/dbtlearn/analyses/full_moon_no_sleep.sql which can be run in the data warehouse

Hooks

  • SQLs that are executed at predefined times
  • can be defined at project, models directory, or individual models level
  • there are four types of hooks
    • on_run_start: executed at the start of dbt {run, seed, snapshot}
    • on_run_end: executed at the end of dbt {run, seed, snapshot}
    • pre-hook: executed before a model/seed/snapshot is built
    • post-hook: executed after a model/seed/snapshot is built

e.g. add a REPORTER role in Snowflake, initially this role does not have access to dbt views and tables

  • in project yaml, add hook under models
    • this will run for all models {​{ this }​}
1
2
    +post-hook:
      - "GRANT SELECT ON {​{ this }​} TO ROLE REPORTER"
  • run dbt run to execute the models
  • then go to Snowflake to check that the reporter role has select permission on all tables and views from dbt

Exposure

  • configurations that can point to external resources like reports and dashboards
  • can be integrated and compiled in documentations

  • e.g. create a yaml file dbtlearn/models/dashboards.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
version: 2

exposures:
  - name: executive_dashboard
    label: Executive Dashboard
    type: dashboard
    maturity: low
    url: <url to dashboard here>
    description: Executive Dashboard about Airbnb listings and hosts
      

    depends_on:
      - ref('dim_listings_w_hosts')
      - ref('mart_fullmoon_reviews')

    owner:
      name: M
      email: email@example.com
  • run dbt docs generate then dbt docs serve
  • the exposures also appear in the lineage graph

dbt-expectations

Great Expectations

  • data testing framework
  • provides a bunch of functions that can be applied to a data

dbt-expectations

  • not a port of Great Expectations in dbt but inspired by it
  • see Available tests
  • to install, add to project.yml, and run dbt deps
1
2
3
packages:
  - package: calogica/dbt_expectations
    version: 0.10.1

examples

  • comparing row counts between models
    • add to schema.yml
    • run dbt test --select dim_listings_w_hosts to only run test for this model
1
2
3
4
  - name: dim_listings_w_hosts
    tests:
      - dbt_expectations.expect_table_row_count_to_equal_other_table:
          compare_model: source('airbnb', 'listings')
  • looking for outliers in your data
1
2
3
4
5
6
7
    columns:
      - name: price
        tests:
        - dbt_expectations.expect_column_quantile_values_to_be_between:
            quantile: .99
            min_value: 50
            max_value: 500
  • implementing test warnings
    • note that config can be set on any tests not just here
1
2
3
4
        - dbt_expectations.expect_column_max_to_be_between:
            max_value: 5000
            config:
              severity: warn
  • validating column types
    • uses the types of backend (in this case Snowflake)
1
2
        - dbt_expectations.expect_column_values_to_be_of_type:
            column_type: number
  • monitoring categorical variables in the source data
    • add to sources.yml file
    • to run a test for a source, run dbt test --select source:airbnb.listings
      • also runs tests that affect the source table e.g.
        • dbt_expectations_source_expect_column_distinct_count_to_equal_airbnb_listings_room_type__4
        • dbt_expectations_expect_table_row_count_to_equal_other_table_dim_listings_w_hosts_source_airbnb_listings_
1
2
3
4
5
6
7
8
9
10
11
12
13
version: 2

sources:
    - name: airbnb
      schema: raw
      tables:
        - name: listings
          identifier: raw_listings
          columns:
            - name: room_type
              tests:
                - dbt_expectations.expect_column_distinct_count_to_equal:
                    value: 4

Debugging

Debugging dbt tests

  • dbt --debug test --select source:airbnb.listings
    • for debugging, will see all the SQLs that dbt runs against the data warehouse in the terminal screen which can get overwhelming
    • or check the compiled SQL in the target folder target/compiled/dbtlearn/models/sources.yml/dbt_expectations_source_expect_a60b59a84fbc4577a11df360c50013bb.sql
      • this can be executed in Snowflake for further debugging

Debugging and working with regular expressions

  • e.g. regex for price $90.00
    • $ and . need to be escaped for regex, then escaped for YAML, then need to be escaped twice so the slashes actually render in Snowflake; these happens when there are many interconnected technologies
    • use the compiled SQL command in targets directory to debug the regex in Snowflake
1
2
3
4
5
6
7
8
9
10
11
sources:
    - name: airbnb
      schema: raw
      tables:
        - name: listings
          identifier: raw_listings
          columns:
            - name: price
              tests: 
                - dbt_expectations.expect_column_values_to_match_regex:
                    regex: "^\\\\$[0-9][0-9\\\\.]+$"

Logging

  • can log messages to dbt’s standard log file, or the terminal, or disable a log message temporarily

Logging to the dbt log file

  • create a macro dbtlearn/macros/logging.sql
  • run the macro dbt run-operation learn_logging
  • this will be logged in the logging file dbtlearn/logs/dbt.log as [debug]
  • can add this Jinja command anywhere e.g. in the middle of a SQL command
1
2
3
{% macro learn_logging() %}
    {{ log("This is a logging message from the macro") }}
{% endmacro %}
dbt-logging-debug

Logging to screen

  • add a second paramater info=True
  • message will appear on the screen, and also in the dbt log file as [info]
[0m21:08:00.947064 [info ] [MainThread]: This is a logging message from the macro
dbt-log-info

Disabling log messages

  • there are two layers of execution in dbt
    • when macro is executed, the jinja will be executed
    • then the SQL from Jinja will be executed
  • so to comment out a log, use a Jinja comment tag # instead of sql comment tag --
    • -- {​{ log("This is a logging message from the macro", info=True) }​}
    • {​# log("This is a logging message from the macro", info=True) #​}
1
2
3
{% macro learn_logging() %}
   {# log("This is a logging message from the macro", info=True) #}
{% endmacro %}

Variables

There are two kinds

  • Jinja variables
    • define and use them in Jinja
  • dbt variables
    • dbt specific-variables which can be passed to dbt through the command line or the project YAML

Jinja variables

  • can be defined using set
  • e.g. create a macro in dbtlearn/macros/variables.sql
1
2
3
4
5
6
{% macro learn_variables() %}

    {% set your_name_jinja = "FooBar" %}
    {{ log("Hello " ~ your_name_jinja, info=True) }}

{% endmacro %}

dbt variables

  • aka Project variables
  • variables that dbt manages for you
  • pass their values from the command line or project YAML
1
2
3
4
5
{% macro learn_variables() %}

    {{ log("Hello dbt user " ~ var("user_name") ~ "!", info=True)}}

{% endmacro %}
  • in commandline, specify the username dbt run-operation learn_variables --vars "{user_name: foobar}"

Setting default values

  • can be set in var() but would need to set default value whenever you set the variable
1
2
3
4
{% macro learn_variables() %}
    {{ log("Hello dbt user " ~ var("user_name", "NO USERNAME IS SET") ~ "!", info=True)}}

{% endmacro %}
  • another way is to define the default value in project YAML file
    • will overwrite default in macro if no user_name is passed in the command line
1
2
vars:
  user_name: default_user_name_for_this_project

Using data ranges to make incremental models production-ready

  • standard practice to use parameterization for data ranges to backfill
  • check variable existence {​% if var("start_date", False) and var("end_date", False) %​}
  • update dbtlearn/models/fct/fct_reviews.sql below
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{{
    config(
        materialized = 'incremental',
        on_schema_change = 'fail'
    )
}}
with src_reviews as (
    select * from {{ ref('src_reviews') }}
)

select 
{{ dbt_utils.generate_surrogate_key(['listing_id', 'review_date', 'reviewer_name', 'review_text'])}} as review_id,
* from src_reviews
WHERE review_text is not null
{% if is_incremental() %}
  {% if var("start_date", False) and var("end_date", False) %}
    {{ log('Loading ' ~ this ~ ' incrementally (start_date: ' ~ var("start_date") ~ ', end_date: ' ~ var("end_date") ~ ')', info=True) }}
    AND review_date >= '{​{ var("start_date") }​}'
    AND review_date < '{​{ var("end_date") }​}'
  {% else %}
    AND review_date > (select max(review_date) from {{ this }})
    {{ log('Loading ' ~ this ~ ' incrementally (all missing dates)', info=True)}}
  {% endif %}
{% endif %}
  • if this is not an incremental load (full refresh), or if it is but has no start date and end date parameters, proceed as the naive implementation i.e.load review date src to fct model if review is not null and review date is after the max date
  • if this is an incremental load but has a start date and end date, then include extra conditions of having the review date in between start date and end date

dbt run --select fct_reviews

1
12:53:32  Loading AIRBNB.DEV.fct_reviews incrementally (all missing dates)

dbt run --select fct_reviews --vars '{start_date: "2024-02-15 00:00:00", end_date: "2024-03-15 23:59:59"}'

1
2:54:46  Loading AIRBNB.DEV.fct_reviews incrementally (start_date: 2024-02-15 00:00:00, end_date: 2024-03-15 23:59:59)
  • review compiled code dbtlearn/target/compiled/dbtlearn/models/fct/fct_reviews.sql and it would have the date ranges
1
2
3
4
5
-- code above
WHERE review_text is not null
    AND review_date >= '2024-02-15 00:00:00'
    AND review_date < '2024-03-15 23:59:59'
  

Orchestrating dbt with dagster

Integrate with other orchestration tools

  • Airflow
    • large community; many answers out there
    • installation is a bit difficult
    • does not really integrate well with dbt other than building the models
    • manages the whole dbt workflow as a unit
  • Prefect
    • more modern ETL tool
    • easy to install; very Pythonic
    • simple integration with dbt
    • doesn’t have full-fledge integration although better than Airflow
  • Azure Data Factory
    • point and click
    • good product in itself but doesn’t really have tight integration with dbt
  • dbt Cloud
    • proprietary offering from dbt Labs
  • dagster
    • new generation scheduling tool

dagster

  • very tight dbt integration; understands dbt project well
  • data concept of dagster (asset) is similar to dbt (model)
  • great UI
  • easy to debug - will see exactly what dagster executed when you execute the workflow

dagster installation

  • create a requirements.txt and pip install -r requirements.txt into a virtual environment
1
2
3
dbt-snowflake
dagster-dbt
dagster-webserver
  • create a dagster project from existing dbt project

dagster-dbt project scaffold --project-name dbt_dagster_project --dbt-project-dir=dbtlearn

  • before launching dagster, make sure that dbt connections are ok and that the dependencies in dbt have been installed
1
2
3
cd dbtlearn
dbt debug
dbt deps
  • add a schedule into the dagster project dbt_dagster_project/dbt_dagster_project/schedules.py, remove the comment in schedules

  • start the dagster project

1
2
cd 'dbt_dagster_project'                                    
dagster dev
  • once server is running, go to http://127.0.0.1:3000/

  • manage dbt on dagster without touching dbt in production directly
  • dagster can also include data integration e.g. airbyte with minimal setup
  • can also connect a python based data source (from API)
dagster

Here are some resources about dagster

Advance Power User dbt core

Here are some advance features using the Power User dbt core extension.

To get the API key, sign up to Altimate.ai. The key is free with the free account sign up. UI will show the API key and instance name which need to be added to the extension in the settings file in VS Code.

With the AI integration, the features now include:

  • Generate documentation
  • Generate dbt model from source definition or SQL
  • Column level lineage
dbt-column-lineage
  • Generate dbt tests
  • Check the health of the dbt project
  • Query explanations
  • dbt project governance using the open source python package datapilot-cli
  • Query translation between SQL dialects

Introducing dbt to the company

As for introducing dbt to the company, I find these tips from the interview interesting:

  • Approach it as an experimentation rather than a business case. Experimentation means you’re going to try if it’s a good fit or not. This would also put less pressure on the team and give them time to learn the tool and find out if they like it or not – rather than introducing it as a business case wherein it’s presented as the solution.
  • Involve the business in workshops to give them opportunities to ask questions, and see if the solution is able to give answer and/or spark discussions. It helps to have a data steward that can bridge the engineering team with the business domain.
  • Act in stages. Stage 1 is all about the business perspective - it needs to address a problem; if successful go to Stage 2. This stage is the technical phase - how can dbt fit in the current stack, are additional infrastructure needed, are there added security measures or compliance issues etc.
  • Work on a POC e.g. migrate a data mart to dbt and evaluate the value from that; 3-months is a good timeline. Helps if the team can work with an external consultant proficient in dbt.

Characteristics of dbt that makes it easy to recommend

  • Python ecosystem; lightweight, easy to put into production (containerized) and integrate into CI/CD
  • Lineage, documentation as first class citizen in dbt makes it easy to own the pipelines (ownership involves being able to answer questions about where the data came from, how it’s transformed, where it was published )
  • Modular approach breaks down large stored proc into interconnected models which can make good points of discussions (about the business logic) and helps with maintenance
  • dbt abstracts the technicalities like parallelization, dependency of models, optimization of execution, so developers can focus on the logic and model building

dbt Certifications

There are currently two exams: one for analytics enginer, and one for cloud administrator. Details here.

Some tips from the interview:

  • It was useful to do the Udemy course as an introduction to dbt.
  • It helps if you’ve practiced dbt for a while before taking the exam, although not impossible to pass without it.
  • Do a lot of practice questions.
  • Questions include multiple choice, scenario-based, git, and dbt flags.
  • Most questions are about dbt core not dbt cloud.

Resources

Footnotes

This post is licensed under CC BY 4.0 by the author.