2025 Learning Log
In keeping up with my goal to do more learning this year, I’m allotting some time outside the daily grind for learning.
My overarching goal is to explore popular data engineering tools such as dbt
, and cloud technologies such as Azure Fabric and Snowflake. I haven’t worked in Databricks since 2021 so it’ll be a good opportunity to re-learn and catch up with the new developments.
Here are some notes and impressions of how the learning is coming along so far, written in reverse chronological order.
- I’ve moved the dbt notes here dbt-notes
- There’s a separate post for ERDs and Mermaid 🧜♀️
25. 2025-02-27 (Thursday) - Data sampling and tasks in Snowflake
Data sampling
Data sampling is useful in development when using an entire database may not be cost-effective or take so much time.
There are two sampling methods:
- ROW or BERNOULLI method
- BLOCK OR SYSTEM method
Row | System |
---|---|
Every row is chosen with a percentage p | Every block (micropartition) is chosen with a percentage p |
More “randomness” | More effective processing; faster processing |
Smaller tables | Larger tables |
1
2
3
4
5
6
7
8
9
10
11
12
-- row sampling
CREATE OR REPLACE VIEW ADDRESS_SAMPLE
AS
SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.CUSTOMER_ADDRESS
SAMPLE ROW (1) SEED(27); --sample 1% of the data; use `SEED` to reproduce
-- system sampling
CREATE OR REPLACE VIEW ADDRESS_SAMPLE
AS
SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.CUSTOMER_ADDRESS
SAMPLE SYSTEM (1) SEED(23);
Tasks
Tasks are objects that store SQL commands. A task can only have one SQL statement. Standalone tasks can be created and scheduled for simple operations. For complex workflows, trees of tasks can be used.
A virtual warehouse where the task will run can be specified. This can also be omitted so that the task execution is “serverless” i.e. managed by Snowflake. The cost of Snowflake managed execution is 1.2x more the usage because of overheads. Specify a warehouse if the load is predictable or stable, while specify serverless if the load is a bit unpredictable, or warehouse is underutilized.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Prepare table
CREATE OR REPLACE TABLE CUSTOMERS (
CUSTOMER_ID INT AUTOINCREMENT START = 1 INCREMENT =1,
FIRST_NAME VARCHAR(40) DEFAULT 'JENNIFER' ,
CREATE_DATE DATE);
-- Using an interval schedule
CREATE OR REPLACE TASK CUSTOMER_INSERT
WAREHOUSE = COMPUTE_WH
SCHEDULE = '1 MINUTE' -- interval
AS
INSERT INTO CUSTOMERS(CREATE_DATE) VALUES(CURRENT_TIMESTAMP);
-- Using a CRON schedule
CREATE OR REPLACE TASK CUSTOMER_INSERT
WAREHOUSE = COMPUTE_WH
SCHEDULE = 'USING CRON 0 7,10 * * 5L UTC'
AS
INSERT INTO CUSTOMERS(CREATE_DATE) VALUES(CURRENT_TIMESTAMP);
Tasks can also be scheduled with conditions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- this will never run
CREATE OR REPLACE TASK CUSTOMER_INSERT
WAREHOUSE = COMPUTE_WH
SCHEDULE = '1 MINUTE'
WHEN 1 = 2
AS
INSERT INTO CUSTOMERS(CREATE_DATE, FIRST_NAME) VALUES(CURRENT_TIMESTAMP, 'MIKE');
-- this will run
CREATE OR REPLACE TASK CUSTOMER_INSERT2
WAREHOUSE = COMPUTE_WH
SCHEDULE = '1 MINUTE'
WHEN 1 = 1
AS
INSERT INTO CUSTOMERS(CREATE_DATE, FIRST_NAME) VALUES(CURRENT_TIMESTAMP, 'DEBIKA');
A task can be created from a stored procedure.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
CREATE OR REPLACE PROCEDURE CUSTOMERS_INSERT_PROCEDURE (CREATE_DATE varchar)
RETURNS STRING NOT NULL
LANGUAGE JAVASCRIPT
AS
$$
var sql_command = 'INSERT INTO CUSTOMERS(CREATE_DATE) VALUES(:1);'
snowflake.execute(
{
sqlText: sql_command,
binds: [CREATE_DATE]
});
return "Successfully executed.";
$$;
CREATE OR REPLACE TASK CUSTOMER_TAKS_PROCEDURE
WAREHOUSE = COMPUTE_WH
SCHEDULE = '1 MINUTE'
AS CALL CUSTOMERS_INSERT_PROCEDURE (CURRENT_TIMESTAMP);
When a task is created, it’s state is suspended by default
1
2
3
4
5
6
7
SHOW TASKS;
ALTER TASK CUSTOMER_INSERT RESUME;
ALTER TASK CUSTOMER_INSERT SUSPEND;
-- A task can be dropped no matter the state
DROP TASK CUSTOMER_INSERT;
Tree of tasks means that some tasks have dependencies. A child task will only run once the parent task finishes. Tree of tasks always have a root tak; root task is the one being scheduled. A parent tasks can have multiple children up to 100, a child task can only have one parent. A tree of tasks can have up to 1000 task members.

Suspend the root task first before modifying the tree including creating child tasks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- child task 1
CREATE OR REPLACE TASK CUSTOMER_INSERT2
WAREHOUSE = COMPUTE_WH
AFTER CUSTOMER_INSERT
AS
INSERT INTO CUSTOMERS2 SELECT * FROM CUSTOMERS;
-- child of child task
CREATE OR REPLACE TASK CUSTOMER_INSERT3
WAREHOUSE = COMPUTE_WH
AFTER CUSTOMER_INSERT2
AS
INSERT INTO CUSTOMERS3 (CUSTOMER_ID,FIRST_NAME,CREATE_DATE) SELECT * FROM CUSTOMERS2;
To run a tree of tasks, first resume the child tasks then the root task. To recursively resume a task and all its dependency:
1
SELECT SYSTEM$TASK_DEPENDENTS_ENABLE( 'CUSTOMER_INSERT' );
To get a history and trouble shoot task runs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
select *
from table(information_schema.task_history())
order by scheduled_time desc;
-- result in a given time
select *
from table(information_schema.task_history(
scheduled_time_range_start=>dateadd('hour',-4,current_timestamp()),
result_limit => 5,
task_name=>'CUSTOMER_INSERT'));
-- result in a given time period
select *
from table(information_schema.task_history(
scheduled_time_range_start=>to_timestamp_ltz('2025-02-27 04:34:00.486 -0800'),
scheduled_time_range_end=>to_timestamp_ltz('2025-02-27 05:34:00.486 -0800')));
24. 2025-02-26 (Wednesday) - Data sharing in Snowflake
Data sharing in Snowflake, like cloning and swapping, is also a metadata process - the data is not copied but a reference to the storage layer containing the data is created.
Moreover, since compute and storage are decoupled, sharing data to other Snowflake account lets the consumer use their own compute resources; cost of compute is on the consumer while storage cost is on the provider. For non-Snowflake consumers however, the data provider would shoulder the cost of compute.
A data provider can also be a data consumer even on the same share object. Shared data is read-only on consumer side. Changes by the provider to the shared data will be immediately reflected on the consumer account.
Data sharing is available on all Snowflake editions except Virtual Private.
What can be shared
- Tables
- External tables
- Secure views
- Secure materialized views
- Secure UDFs
Set up share
- Create share in a provider account
ACCOUNTADMIN
role orCREATE SHARE
privileges requiredCREATE SHARE my_share;
Grant privileges to share to the database, schema, and table
1 2 3
GRANT USAGE ON DATABASE my_db TO SHARE my_share; GRANT USAGE ON SCHEMA my_schema.my_db TO SHARE my_share; GRANT SELECT ON TABLE my_table.myschema.my_db TO SHARE my_share;
Can also grant
SELECT
privileges to all tables in a schema or database1 2
GRANT SELECT ON ALL TABLES IN SCHEMA OUR_FIRST_DB.PUBLIC TO SHARE COMEPLETE_SCHEMA_SHARE; GRANT SELECT ON ALL TABLES IN DATABASE OUR_FIRST_DB TO SHARE COMEPLETE_SCHEMA_SHARE;
- Add consumer account(s); possible to have zero consumer accounts to a share
1
ALTER SHARE my_share ADD ACCOUNT <account-id>
- The consumer needs to import the share by creating a database from the shared object
ACCOUNTADMIN
role orIMPORT SHARE / CREATE DATABASE
privileges requiredcreate a database from the share
1
CREATE DATABASE my_db from SHARE my_share;
- The consumer can then grant additional privileges to that share
Share objects can contain only one database. Multiple databases require a (secured) view. Multiple schemas and tables in a database can be shared however.
Data share can also be setup and modified using the Snowflake UI. Got to Data Products > Private Sharing > Shared by your Account / Reader accounts.
A reader account is required to be setup by the data provider when sharing data to a non-Snowflake consumer. This account is managed by the cloud provider, and is created solely for the purpose of data sharing. Both storage and compute cost in this case will be shouldered by the data provider.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
---- PROVIDER ACCOUNT -----
-- Create a reader account
CREATE MANAGED ACCOUNT tech_joy_account
ADMIN_NAME = tech_joy_admin,
ADMIN_PASSWORD = 'set-pwd',
TYPE = READER;
-- Share the data --
ALTER SHARE ORDERS_SHARE
ADD ACCOUNT = <reader-account-id>;
---- READER ACCOUNT -----
-- Setup virtual warehouse
CREATE WAREHOUSE READ_WH WITH
WAREHOUSE_SIZE='X-SMALL'
AUTO_SUSPEND = 180
AUTO_RESUME = TRUE
INITIALLY_SUSPENDED = TRUE;
-- Create and set up users --
-- Create user
CREATE USER USER_NAME PASSWORD = '<password>';
-- Grant usage on warehouse
GRANT USAGE ON WAREHOUSE READ_WH TO ROLE PUBLIC;
-- Grant privileges on a Shared Database for other users
GRANT IMPORTED PRIVILEGES ON DATABASE DATA_SHARE_DB TO ROLE PUBLIC;
-- Show all shares, get the provider account here
SHOW SHARES;
-- See details on share
DESC SHARE <provider_account>.ORDERS_SHARE;
-- Create a database in consumer account using the share
CREATE DATABASE DATA_SHARE_DB FROM SHARE <provider_account>.ORDERS_SHARE;
-- Validate table access
SELECT * FROM DATA_SHARE_DB.PUBLIC.ORDERS;
For sharing views, it is recommended to share secured views as standard view can reveal underlying information on SHOW VIEWS;
To create a secure view (provider account)
1
2
3
4
-- Create a secure view
CREATE OR REPLACE SECURE VIEW SECURE_VIEW AS
<select statement>
Use a view to reference multiple databases. Can also create a database for sharing views referencing multiple databases. In addition to granting usage on database and schema, grant reference usage to the other databases used in the view
1
2
3
4
5
6
7
8
-- also grant reference usage to other dbs
GRANT REFERENCE_USAGE ON DATABASE OTHER_DB TO SHARE VIEW_SHARE;
-- grant select privilege on the view
GRANT SELECT ON VIEW SECURE_VIEW TO SHARE VIEW_SHARE;
-- create a database from view
CREATE DATABASE VIEW_DB FROM SHARE <provider_account>.VIEW_SHARE;
Other sources of shared data in Snowflake
- Marketplace
- This is where third-party databases can be accessed
- Some are free, while others paid
- Can be imported by
ACCOUNTADMIN
or account withIMPORT SHARE
privileges
- Data Exchange
- A private hub for sharing data
- Members are through invite only
- This needs to be enabled by reaching out to Snowflake support
23. 2025-02-25 (Tuesday) - Table types in Snowflake, Zero-copy cloning, Swapping
Table types
Different table types differ in Time travel and Failsafe capabilities, and can be used to manage storage cost
Permanent | Transient | Temporary |
---|---|---|
CREATE TABLE | CREATE TRANSIENT TABLE | CREATE TEMPORARY TABLE |
Time travel and Failsafe | Time travel up to 1 day, and no Failsafe | Time travel up to 1 day, and no Failsafe |
Persists after session | Persists after session | Does not persist after session |
For important data that needs full data protection | For large data that does not need full data protection | For testing, or data that does not need to persist |
1
2
3
4
-- test_name is either a transient or temporary table
-- This will not work
ALTER TABLE test_name
SET DATA_RETENTION_TIME_IN_DAYS = 2; -- only 0 or 1 will work
No naming conflict will occur between temporary table with the same name as permanent or transient table. In this case, usage will be as if the transient or permanent table is hidden. Creating a permanent table after transient table of the same name will supersede the transient table, creating a transient table of the same name will supersede the permanent table. Temporary table persists unless the session is closed.
Transient databases or schemas can be created, but not temporary databases or schemas
1
2
CREATE TRANSIENT DATABASE transient_db;
CREATE TRANSIENT SCHEMA transient_schema;
Objects in a permanent database can be permanent, transient or temporary tables. Objects in a transient database can be either transient or temporary tables not permanent.
Zero-copy cloning
Zero-copy means data isn’t really copied, just the metadata. It is a metadata operation executed by the cloud provider. No additional storage cost will be incurred unless updates such as modifying data or adding data to the clone are carried out.
1
2
CREATE TABLE table_name_clone
CLONE table_name;
Zero-copy cloning is commonly done on databases, schemas, and tables but can also be applied to other objects (such as stream, file format, sequence, external stages, pipes for external stages, tasks, and database roles).
Clones are snapshots of the source, and are independent objects. Sources are not affected if clones are modified. Likewise, clones are not affected if sources are modified. Clones are useful for development purposes. Cloning a database or schema will recursively clone contained objects.
Other considerations:
- Cloning does not include the history of the source but it would have its own history.
- Cloning a temporary table or a transient table to a permanent table is not possible
- Cloning a permanent table to a temporary or transient table is possible
- Can clone within the same database, and between different databases.
More information in docs.
A clone can be created from a time travel point. A clone can also be cloned.
1
2
CREATE OR REPLACE TABLE time_travel_clone
CLONE time_travel before (statement => '<query_id');
Swap
Swapping is also a metadata operation, and is useful when switching development to production.
1
2
3
4
5
ALTER TABLE table_name ...
SWAP WITH target_table_name
ALTER SCHEMA schema_name ...
SWAP WITH schema_table_name
In this case, after swapping, table_name
would have the (meta)data of target_table_name
and target_table_name
would have the (meta)data of table_name
. It is essentially a name swap.
Swapping a permanent or transient table with a temporary table is not allowed as this may result in a name conflict when the temporary table is renamed as a permanent or transient table.
22. 2025-02-24 (Monday) - Snowflake Time travel and fail safe
Time travel
Time travel lets users query deleted or updated data, and restore tables, schemas, and databases. The maximum retention period for time travel depends on the Snowflake edition; however, unless set, the default is 1 day.
Edition | Time travel period up to n days |
---|---|
Standard | 1 |
Enterprise | 90 |
Virtual Private | 90 |
There are several ways to do time travel, using
TIMESTAMP
OFFSET
in secondsQUERY_ID
1
2
3
SELECT * FROM table AT (TIMESTAMP => timestamp);
SELECT * FROM table AT(OFFSET => -10*60);
SELECT * FROM table BEFORE(STATEMENT => <query_id>);
Tables, schemas, and databases can be undropped. UNDROP
will fail if there is an existing table with the same name. If multiple tables of the same name are dropped, restore will take effect for the latest table only. Ownership privileges are needed to restore objects.
1
2
3
UNDROP TABLE table_name;
UNDROP SCHEMA schema_name;
UNDROP DATABASE database_name;
DO NOT replace an existing table with a query id - this will delete the historical query ids. Instead, create a backup table with the restored data, truncate and insert from backup table into the target table. Target table will retain the historical query ids.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Bad method
CREATE OR REPLACE TABLE DB_NAME.public.test as
SELECT * FROM DB_NAME.public.test before (statement => '<query_id>');
SELECT * FROM DB_NAME.public.test;
-- Good method
-- Create a back up table
CREATE OR REPLACE TABLE DB_NAME.public.test_backup as
SELECT * FROM DB_NAME.public.test before (statement => '<query_id>');
-- Truncate and insert into target table from backup table
TRUNCATE DB_NAME.public.test;
INSERT INTO DB_NAME.public.test
SELECT * FROM DB_NAME.public.test_backup;
SELECT * FROM DB_NAME.public.test ;
Larger retention time results in higher cost. Retention time can be set per table. Zero (0) retention time in a table means the table can not be recovered. It will automatically go into Fail Safe mode.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Contains table info including retention period
SHOW TABLES LIKE '%CUSTOMERS%';
-- Update retention period
ALTER TABLE DB_NAME.PUBLIC.CUSTOMERS
SET DATA_RETENTION_TIME_IN_DAYS = 2;
-- Specifiy table retention period on creation
CREATE OR REPLACE TABLE DB_NAME.public.ret_example (
id int,
first_name string,
last_name string,
email string,
gender string,
Job string,
Phone string)
DATA_RETENTION_TIME_IN_DAYS = 3;
Fail safe
Fail safe ensures protection of historical data in case of disaster. This is a non-configurable 7-day period that permanent tables can be restored. It starts immediately after the retention period is over, and requires reaching out to Snowflake support. This also contributes to storage cost.
Fail safe and Time travel storage can be viewed in Snowflake UI: Admin > Cost Management, choose the Snowflake account name, usage type: Storage, group by: Type

Storage can also be queried from Snowflake database.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Storage usage on account level
SELECT USAGE_DATE,
STORAGE_BYTES / (1024*1024*1024) AS STORAGE_GB,
STAGE_BYTES / (1024*1024*1024) AS STAGE_GB,
FAILSAFE_BYTES / (1024*1024*1024) AS FAILSAFE_GB
FROM SNOWFLAKE.ACCOUNT_USAGE.STORAGE_USAGE ORDER BY USAGE_DATE DESC;
-- Storage usage on table level
SELECT ID,
TABLE_NAME,
TABLE_SCHEMA,
ACTIVE_BYTES / (1024*1024*1024) AS STORAGE_USED_GB,
TIME_TRAVEL_BYTES / (1024*1024*1024) AS TIME_TRAVEL_STORAGE_USED_GB,
FAILSAFE_BYTES / (1024*1024*1024) AS FAILSAFE_STORAGE_USED_GB
FROM SNOWFLAKE.ACCOUNT_USAGE.TABLE_STORAGE_METRICS
ORDER BY FAILSAFE_STORAGE_USED_GB DESC;
21. 2025-02-21 (Friday) - Snowflake Snowpipe
Snowpipe enables automatically loading files from cloud provider bucket or container into Snowflake warehouse. It is suitable for continuous loading rather than batch loading. (For batch loading, COPY command is more suitable)
General step (AWS S3)
- In Snowflake, create stage with the storage integration object and appropriate file format object
- Create a table
- Create a pipe object, and see description
DESC
- Test the COPY command independently to make sure it works
1 2 3 4 5 6 7 8
CREATE OR REPLACE pipe MANAGE_DB.pipes.pipe_name auto_ingest = TRUE AS COPY INTO DB_NAME.PUBLIC.table_name FROM @MANAGE_DB.external_stages.csv_folder ; // Describe pipe DESC pipe pipe_name;
- In
DESC
, take note of (copy)notification_channel
value, will setup the AWS S3 with this - In AWS S3, configure bucket by adding event notification with appropriate role
- Add the
notification_channel
value to SQS queue ARN
- Add the
- Upload a file to the S3 bucket, and check Snowflake table to check the load
For error handling, query which files have been loaded, and the respective errors in Snowpipe or the COPY command
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// see what files have been processed
ALTER PIPE pipe_name refresh;
// Validate pipe is actually working
SELECT SYSTEM$PIPE_STATUS('pipe_name');
// Snowpipe error message
SELECT * FROM TABLE(VALIDATE_PIPE_LOAD(
PIPE_NAME => 'MANAGE_DB.pipes.pipe_name',
START_TIME => DATEADD(HOUR,-2,CURRENT_TIMESTAMP())));
// COPY command history from table to see error massage
SELECT * FROM TABLE (INFORMATION_SCHEMA.COPY_HISTORY(
table_name => 'DB_NAME.PUBLIC.TABLE_NAME',
START_TIME =>DATEADD(HOUR,-2,CURRENT_TIMESTAMP())));
In Azure, aside from the above equivalent steps to grant appropriate permission to the Snowflake app, the Queue needs to be set up as well as a notification integration object in Snowflake
1
2
3
4
5
6
CREATE OR REPLACE NOTIFICATION INTEGRATION snowpipe_event
ENABLED = true
TYPE = QUEUE
NOTIFICATION_PROVIDER = AZURE_STORAGE_QUEUE
AZURE_STORAGE_QUEUE_PRIMARY_URI = 'azure://<your-container-url>'
AZURE_TENANT_ID = '<your-tenant-id>';
20. 2025-02-20 (Thursday) - Loading data from AWS, Azure, and GCP into Snowflake
General steps:
- In cloud provider, create a storage account. There is file-transfer savings when selecting the same region as Snowflake.
- Create a bucket or container
- Upload the file
In Snowflake, setup the a storage integration object
1 2 3 4 5 6 7 8 9
create or replace storage integration s3_int TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = S3 ENABLED = TRUE STORAGE_AWS_ROLE_ARN = '' -- Value from AWS Role STORAGE_ALLOWED_LOCATIONS = ('s3://<your-bucket-name>/<your-path>/', 's3://<your-bucket-name>/<your-path>/') COMMENT = 'This an optional comment' ; DESC integration s3_int;
- In cloud provider, set Policies / Permissions and add appropriate role of the object from Snowflake. The values for these are returned from
DESC
To Load the file into Snowflake
- Create a file format
Create a stage referencing the storage integration object
1 2 3 4 5 6 7 8 9 10 11
create or replace file format demo_db.public.fileformat_azure TYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1; -- try 0 first to check if there is a header or not create or replace stage demo_db.public.stage_azure STORAGE_INTEGRATION = azure_integration URL = '<azure://storageaccountname.blob.core.windows.net/containername>' FILE_FORMAT = fileformat_azure; LIST @demo_db.public.stage_azure;
- Create a table
- Load the data into the table from stage using the COPY command
It’s also possible to unload data from Snowflake to the cloud provider. The default copy is compressed with automatically selected compression type but it’s also possible to specify how to compress the data.
1
2
3
4
5
6
7
-- create storage object
-- create file format
-- create stage (e.g. @stage_gcp) referencing storage object and file format
COPY INTO @stage_gcp
FROM
TABLE;
19. 2025-02-19 (Wednesday) - Performance considerations in Snowflake, scaling up/down, scaling out, caching, and cluster keys
Many performance optimizations are managed automatically in Snowflake. However, there are still designs and best practices that users can do. These include
- Assigning appropriate data types
- Sizing virtual warehouses
- Taking advantage of caching
- Customizing cluster keys
Dedicated virtual warehouse
Have separate virtual warehouses according to different workloads or user groups e.g. dedicated virtual warehouses for BI team, Data Science team, and Marketing department
Consider:
- Not too many warehouses such that underutilization occurs
- Refine classifications as work patterns change
Scaling up vs Scaling Out
Scaling up | Scaling out |
---|---|
Increasing the size of virtual warehouses | Using additional warehouses / multi-cluster warehouses of the same size |
More complex query | More concurrent users / queries |
Caching
- If query is executed twice, results are cached and can be re-used
- Speeds up the queries
- Results are cached for 24 hours or until underlying data has changed
- Ensure that similar queries go on the same warehouse
- Running the same query on the same data on the same warehouse will be faster than running it for the first time
- Group same usage patterns together
Cluster key
- Subset of rows to locate the data in micro-partitions
- For large tables, this improves the scan efficiency in queries
- Snowflake automatically maintains these cluster keys
- In general, Snowflake produces well-clustered tables
- Cluster keys are not always ideal and can change over time
- Manually customize these cluster keys
How to cluster
- Columns that are used most frequently in WHERE-clauses (often date columns for even tables)
- If typically use filters on two columns then the table can also benefit from two cluster keys
- Column that is frequently used in Joins
- Large enough number of distinct values to enable effective grouping
- Small enough number of distinct values to allow effective grouping
1
2
3
4
CREATE TABLE TABLE_NAME ... CLUSTER BY COLUMN_1 [, COLUMN_2];
CREATE TABLE TABLE_NAME ... CLUSTER BY EXPRESSION;
ALTER TABLE TABLE_NAME ... CLUSTER BY EXPRESSION_1, [, EXPRESSION_2];
ALTER TABLE TABLE_NAME ... DROP CLUSTERING KEY
Partition pruning - eliminates all other partitions to look at because there are no data of interest there based on micro-partitions from cluster keys
18. 2025-02-18 (Tuesday) - Loading unstructured data into Snowflake
Steps to load unstructured data (such as JSON or Parquet) into Snowflake:
- Create stage, define file format
- Load raw data (type
VARIANT
can handle all kinds of unstructured data) - Analyse & Parse
- Flatten & Load
Load JSON files
To load JSON file(s) into a Raw Table, create a stage, define a JSON file format, create a raw data table, then copy the raw JSON into the raw data table
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
USE DB_NAME.SCHEMA_NAME;
-- Load Raw JSON
CREATE OR REPLACE stage JSONSTAGE
url='s3://somejsonfiles';
LIST @JSONSTAGE;
CREATE OR REPLACE FILE FORMAT JSONFORMAT
TYPE = JSON;
-- Create table containing one column with type Variant
CREATE OR REPLACE TABLE JSON_RAW (
RAW_FILE VARIANT
);
-- Load staged JSON file into table using COPY command
COPY INTO JSON_RAW
FROM @JSONSTAGE
file_format= JSONFORMAT
files = ('data.json');
SELECT * FROM JSON_RAW;
Parse the raw JSON data using colon notation (:
). Cast to type using double colons (::
). Arrays are indexed from 0. JSON keys are case-sensitive (otherwise the result is null
when keys are not found).
1
2
3
4
5
6
7
8
9
10
11
SELECT
RAW_FILE:id::int as id,
RAW_FILE:first_name::STRING as first_name,
RAW_FILE:last_name::STRING as last_name,
RAW_FILE:gender::STRING as gender
-- nested data
RAW_FILE:job.salary::INT as salary,
RAW_FILE:job.title::STRING as title
-- array
RAW_FILE:prev_company[0]::STRING as prev_company
FROM JSON_RAW;
Use TABLE()
function to flatten hierarchical data
1
2
3
4
5
6
select
RAW_FILE:first_name::STRING as First_name,
f.value:language::STRING as "Language",
f.value:level::STRING as Level_spoken
from JSON_RAW,
table(flatten(RAW_FILE:spoken_languages)) f;
To load into a table, use CREATE TABLE AS
OR INSERT INTO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Create table
CREATE OR REPLACE TABLE Languages AS
select
RAW_FILE:first_name::STRING as First_name,
f.value:language::STRING as "Language",
f.value:level::STRING as Level_spoken
from JSON_RAW,
table(flatten(RAW_FILE:spoken_languages)) f;
-- Insert into. Table needs to be created before hand
INSERT INTO Languages
select
RAW_FILE:first_name::STRING as First_name,
f.value:language::STRING as "Language",
f.value:level::STRING as Level_spoken
from JSON_RAW,
table(flatten(RAW_FILE:spoken_languages)) f;
Load Parquet files
Parquet files can be loaded from Stage
Create a stage and Parquet file format
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Create file format and stage object
CREATE OR REPLACE FILE FORMAT PARQUET_FORMAT
TYPE = 'parquet';
CREATE OR REPLACE PARQUETSTAGE
url = 's3://someparquetfiles'
FILE_FORMAT = PARQUET_FORMAT;
-- Preview the data
LIST @PARQUETSTAGE;
SELECT * FROM @PARQUETSTAGE;
Query using colon notation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Query parquet from staging
SELECT *
FROM @PARQUETSTAGE
(file_format => PARQUET_FORMAT);
SELECT
$1:__index_level_0__::int as index_level,
$1:cat_id::VARCHAR(50) as category,
DATE($1:date::int ) as Date,
$1:"dept_id"::VARCHAR(50) as Dept_ID,
$1:"id"::VARCHAR(50) as ID,
$1:"item_id"::VARCHAR(50) as Item_ID,
$1:"state_id"::VARCHAR(50) as State_ID,
$1:"store_id"::VARCHAR(50) as Store_ID,
$1:"value"::int as value
FROM @PARQUETSTAGE;
Load into a table from Stage using COPY command.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Create table
CREATE OR REPLACE TABLE PARQUET_DATA (
ROW_NUMBER int,
index_level int,
cat_id VARCHAR(50),
date date,
dept_id VARCHAR(50),
id VARCHAR(50),
item_id VARCHAR(50),
state_id VARCHAR(50),
store_id VARCHAR(50),
value int,
Load_date timestamp default TO_TIMESTAMP_NTZ(current_timestamp));
-- Load using COPY command
COPY INTO DB_NAME.PUBLIC.PARQUET_DATA
FROM (SELECT
METADATA$FILE_ROW_NUMBER,
$1:__index_level_0__::int,
$1:cat_id::VARCHAR(50),
DATE($1:date::int ),
$1:"dept_id"::VARCHAR(50),
$1:"id"::VARCHAR(50),
$1:"item_id"::VARCHAR(50),
$1:"state_id"::VARCHAR(50),
$1:"store_id"::VARCHAR(50),
$1:"value"::int,
TO_TIMESTAMP_NTZ(current_timestamp)
FROM @PARQUETSTAGE
);
SELECT * FROM PARQUET_DATA;
17. 2025-02-17 (Monday) - Snowflake Copy options, rejected records, load history
Copy options modify the behaviour of the COPY INTO
command
General form:
1
2
3
4
COPY INTO TABLE_NAME
FROM @STAGE_NAME
FILE = 'file_name1', ...
CopyOptions
VALIDATION_MODE
- validates the data files before loading them
1
VALIDATION_MODE = RETURN_n_ROWS | RETURN_ERRORS
SIZE_LIMIT
- Specifies the maximum size (in bytes) of data loaded in the command
- Combined file sizes of all files
- Will load first file regardless; the next file will be loaded until limit is reached (can exceed limit on last file); next file after the limit will not be loaded
1
SIZE_LIMIT = num
RETURN_FAILED_ONLY
- Specifies whether to return only files that have failed to load in the statement result
1
RETURN_FAILED_ONLY = TRUE | FALSE
TRUNCATECOLUMNS
- Specifies whether to truncate text strings that exceed the target column length
TRUE
- strings are automatically truncated to the target column lengthFALSE
- COPY produces an error if a loaded string exceeds the target column length
TRUNCATECOLUMNS = TRUE | FALSE
FORCE
- Specifies to load all files, regardless of whether they’ve been loaded previously and have not changed since they were loaded
- Note that this option reloads files, potentially duplicating data in a table
FORCE = TRUE | FALSE
Rejected records can be queried, further processed, and saved to a table
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---- 1) With VALIDATION_MODE ----
COPY INTO COPY_DB.PUBLIC.ORDERS
FROM @aws_stage_copy
file_format= (type = csv field_delimiter=',' skip_header=1)
pattern='.*Order.*'
VALIDATION_MODE = RETURN_ERRORS;
-- Storing rejected /failed results in a table
CREATE OR REPLACE TABLE rejected AS
select rejected_record from table(result_scan(last_query_id()));
select rejected_record from table(result_scan('01ba740e-030c-57ce-000b-8ec3000360e2'));
SELECT * FROM REJECTED;
---- 2) Without VALIDATION_MODE ----
COPY INTO COPY_DB.PUBLIC.ORDERS
FROM @aws_stage_copy
file_format= (type = csv field_delimiter=',' skip_header=1)
pattern='.*Order.*'
ON_ERROR=CONTINUE;
select * from table(validate(orders, job_id => '_last'));
--- can be processed further
SELECT REJECTED_RECORD FROM rejected;
CREATE OR REPLACE TABLE rejected_values as
SELECT
SPLIT_PART(rejected_record,',',1) as ORDER_ID,
SPLIT_PART(rejected_record,',',2) as AMOUNT,
SPLIT_PART(rejected_record,',',3) as PROFIT,
SPLIT_PART(rejected_record,',',4) as QUATNTITY,
SPLIT_PART(rejected_record,',',5) as CATEGORY,
SPLIT_PART(rejected_record,',',6) as SUBCATEGORY
FROM rejected;
Load history can be queried from database view or from snowflake
database
1
2
3
4
5
6
7
8
9
-- Query load history within a database --
USE COPY_DB;
SELECT * FROM information_schema.load_history;
-- Query load history gloabally from SNOWFLAKE database --
SELECT * FROM snowflake.account_usage.load_history;
16. 2025-02-13 (Thursday) - Snowflake COPY command, transformation, file format object
There are two main methods of loading data into Snowflake:
- Bulk - most frequent method; uses warehouse, COPY command
- Continuous - for small volumes of data; automatic once data is added to stage; Snowpipe (serverless feature)
Stages are objects in Snowflake that pertain to the location of data files. Not to be confused with staging tables in data warehouses.
There are two types of Stages
- External - from cloud provider
- Internal - from Snowflake-managed location
Example of Stage
1
2
3
4
5
6
7
8
9
10
-- Create external stage
CREATE OR REPLACE STAGE MANAGE_DB.external_stages.aws_stage
url='s3://somefilefolder'
credentials=(aws_key_id='ABCD_DUMMY_ID' aws_secret_key='1234abcd_key');
-- Describe properties
DESC STAGE MANAGE_DB.external_stages.aws_stage;
-- List files in stage
LIST @aws_stage
Example of COPY command
1
2
3
4
5
6
7
8
9
10
11
12
13
COPY INTO DB_NAME.PUBLIC.ORDERS
FROM @MANAGE_DB.external_stages.aws_stage
file_format= (type = csv field_delimiter=',' skip_header=1)
pattern='.*Order.*';
-- Error handling using the ON_ERROR
COPY INTO DB_NAME.PUBLIC.ORDERS_EX
FROM @MANAGE_DB.external_stages.aws_stage_errorex
file_format= (type = csv field_delimiter=',' skip_header=1)
files = ('OrderDetails_error.csv','OrderDetails_error2.csv')
ON_ERROR = 'SKIP_FILE';
-- ON_ERROR = { CONTINUE | SKIP_FILE | SKIP_FILE_<num> | 'SKIP_FILE_<num>%' | ABORT_STATEMENT }
Example of Transformation
1
2
3
4
5
6
7
8
9
COPY INTO DB_NAME.PUBLIC.ORDERS_EX
FROM (select
s.$1,
s.$2,
s.$3,
CASE WHEN CAST(s.$3 as int) < 0 THEN 'not profitable' ELSE 'profitable' END
from @MANAGE_DB.external_stages.aws_stage s)
file_format= (type = csv field_delimiter=',' skip_header=1)
files=('OrderDetails.csv');
Example of file format
1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE OR REPLACE file format my_file_format
TYPE=CSV
FIELD_DELIMITER = '|'
SKIP_HEADER = 1;
--Describe file format
DESC FILE FORMAT my_file_format;
--Using file format object in Copy command
COPY INTO CUSTOMERS
FROM @aws_stage
file_format= (FORMAT_NAME= EXERCISE_DB.PUBLIC.my_file_format);
15. 2025-02-12 (Wednesday) - Snowflake editions, pricing and cost monitoring, roles
The different Snowflake editions are:
- Standard
- Enterprise
- Business-Critical
- Virtual Private
which have increasing data security, time travel days, failover support, but also cost. Virtual Private is an edition that needs to be coordinated directly with Snowflake.
For Pricing, it can be broken down into Compute, Storage, and Data transfer costs. There is also a Cloud Service cost for behind the scene cloud tasks. This is usually minimal and only charged if the Cloud service exceeds 10% of the warehouse consumption. Data transfer is charged at egress, ingress is free (which is usually the case for cloud providers).
The Snowflake UI has a Cost Management page accessible to the ACCOUNTADMIN. In it, there is a Resource Monitor feature wherein a quota per cycle (e.g. 10 credits per month) can be set. Actions can be taken depending on the percentage of quota consumed which can exceed 100%. Actions include: 1) notify, 2) suspend and notify (which waits for queries to finish before suspending), and 3) suspend immediately and notify.
Lastly, the different Roles were discussed and their respective resposiblities
- ORGADMIN - manages multiple accounts
- ACCOUNTADMIN - manages all objects in the account
- SECURITYADMIN - manages grant privileges
- SYSADMIN - creates warehouses, databases, and other objects, and grants privileges on them
- USERADMIN - user and role management
- PUBLIC - automatically granted to new users by default
- Custom - assigned under SYSADMIN so the latter can manage the objects created by Custom
14. 2025-02-11 (Tuesday) - Snowflake setup, architecture overview, loading data
Taking a step towards my goals this year, I’ve started doing a deep-dive into Snowflake through learning about it in this course Snowflake Masterclass.
Today, I’ve setup a Business critical Snowflake account with AWS, had a tour of the interface, and created a virtual warehouse, database, and schema using both the UI and SQL. I’ve also created a table, and loaded data from S3 bucket.
There were two assignments:
- For the first one, I’ve created a virtual warehouse using SQL with a custom auto suspend time specified
- For the second one, I’ve created a new database and table, and loaded a customers CSV file from S3 bucket into that table.
My main learnings for today are:
- Snowflake consists of 3 layers: Storage, Query Processing, and Cloud Services
- Warehouse sizes are from XS to 6XL, each consisting of increasing number of servers, and costing increasing number of credits per hour; increase the size based on the amount of compute required e.g. for complex computations
- Multi-clustering consists of more than one size in a cluster (e.g. S-S-S); increase cluster size based on the workload e.g. increased number of queries or users
- Cost is base on credits spent. Each credit cost in money depends on the cloud provider and region. Credits get spent on the uptime and number of servers.
13. 2025-02-10 (Monday) - streamlit
I’ve checked out Streamlit, a Python library for creating web apps. Unlike other libraries or frameworks like Django or even Flask, Streamlit is able to spin up a web app fast using simple syntax. It is specially useful for data science and machine learning projects.
It is designed for quickly creating a data-driven web application. I’m not clear if it’s “production-quality” and opinions seem to be divided and depend on requirements or use-case.
e.g.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import streamlit as st
import numpy as np
import pandas as pd
st.title('A Sample Streamlit App')
st.markdown('## This is a line chart')
chart_data = pd.DataFrame(
np.random.randn(20, 3),
columns=['a', 'b', 'c'])
st.line_chart(chart_data)
st.markdown('## This is a table')
st.dataframe(chart_data)

More features in the Docs
12. 2025-02-09 (Sunday) - ERD, Mermaid
Today, I reviewed ERDs and revisited Mermaid 🧜♀️
11. 2025-02-06 (Thursday) - docker, dbt-duckdb, Duckdb resources
docker
I wanted to check Docker 🐳 and see if I can try and create a container for a data pipeline. Well, that is the goal but since I don’t use Docker these days, I needed to reacquaint myself with it first.
These are some of the intro I found
- The intro to Docker I wish I had when I started
- Learn Docker in 7 Easy Steps - Full Beginner’s Tutorial
- Containerize Python Applications with Docker
an example of Dockerfile
1
2
3
4
FROM python:3.9
ADD main.py .
RUN pip install scikit-learn
CMD ["python", "./main.py"]
.dockerignore
specifies the files or paths that are excluded when copying to the container
Ideally, there is only one process per container
docker-compose.yml
- for running multiple containers at the same time
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
version: '3'
services:
web:
build: .
ports:
- "8080:8080"
db:
image: "mysql"
environment:
MYSQL_ROOT_PASSWORD: password
volumes:
- db-data:/foo
volumes:
db-data:
docker-compose up
to run all the containers together
docker-compose down
to shutdown all containers
Will revisit this topic in the succeeding days.
dbt-duckdb
Here is the repo for the adapter: dbt-duckdb
Installation should be
pip3 install dbt-duckdb
Duckdb resources
Putting these resouces here:
- Duckdb tutorial for beginners
- YT Channel @motherduckdb
I just realized that this feature solves some of my challenging tasks: Since Duckdb is able to read a CSV and execute SQL query on it, there’s no need to open a raw CSV just to check the sum of columns for example. The computation is in-memory too by default so no need to persist a database for quick analysis.
1
SELECT * FROM read_csv_auto('path/to/your/file.csv');
or just using the terminal
1
$ duckdb -c "SELECT * FROM read_parquet('path/to/your/file.parquet');"
10. 2025-02-05 (Wednesday) - dbt-Fabric, Fireducks
dbt-Fabric
I was looking around for how to integrate dbt to Azure, and I found these resouces
The process looks straight-forward
- Install the adapter in the virtual environment with Python 3.7 and up
pip install dbt-fabric
- Make sure to have the Microsoft ODBC Driver for Sql Server installed
- Add an existing Fabric warehouse
- The dbt profile in the home directory needs to be setup
- Connect to the Azure warehouse, do the authentication
- Check the connections
- Then the dbt project can be built
Aside from Fabric, dbt also has integrations with other Azure data platforms:
- Synapse
- Data Factory
- SQL Server (SQL Server 2017, SQL Server 2019, SQL Server 2022 and Azure SQL Database)
Here is the docs for other dbt core platform connections
Fireducks
There’s another duck in the data space: Fireducks . It’s not related to DuckDB, instead it allows existing code using pandas
to be more performant; it’s fully compatible with Pandas API.
This means there’s zero learning cost as all that’s needed is to replace pandas
with fireducks
and the code should be good to go
1
2
# import pandas as pd
import fireducks.pandas as pd
or via terminal (no need to change the import)
1
python3 -m fireducks.pandas main.py
Whereas for polars
, the code would need to be rewritten; polars code is closer to PySpark.
e.g.
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
import polars as pl
from pyspark.sql.functions import col
# load the file ...
#
# Filter rows where 'column1' is greater than 10
# pandas
filtered_df = df[df['column1'] > 10]
# polars
filtered_df = df.filter(pl.col('column1') > 10
# PySpark
filtered_df = df.filter(col('column1') > 10)
Here’s a comparison of the performance: Pandas vs. FireDucks Performance Comparison. It surpassed both polars
and pandas
9: 2025-02-04 (Tuesday) - Snowflake, Duckdb, Isaac Asimov Books
I’m taking a break from learning dbt, and switched focus to learning about databases: Snowflake, and Duckdb. Also, I took a bit of a break so that I can catch up on reading Isaac Asimov’s series.
Snowflake virtual warehouse
I found this free introducton in Udemy Snowflake Datawarehouse & Cloud Analytics - Introduction. This may not be the best resource out there though as it hasn’t been updated. However, I learned about provisioning a virtual warehouse in Snowflake with the following example commands
Virtual warehouse
- collection of compute resources (CPUs and allocated memory)
- needed to query data from snowflake, and load data into snowflake
- autoscaling is available in enterprise version but not in standard version
Create a virtual warehouse
1
2
3
4
5
6
7
8
CREATE WAREHOUSE TRAINING_WH
WITH
WAREHOUSE_SIZE = XSMALL --size
AUTO_SUSPEND = 60 -- if idle for 60s will be suspended automatically
AUTO_RESUME = TRUE -- if use executes a query, it will automatically resume on it's own without having to manually restarted the warehouse
INITIALLY_SUSPEND = TRUE
STATEMENT_QUEUED_TIMEOUT_IN_SECONDS = 300
STATEMENT_TIMEOUT_IN_SECONDS = 600;
Alternatively, use the user interface as well
Create database
1
2
3
CREATE DATABASE SALES_DB
DATA_RETENTION_TIME_IN_DAYS = 0
COMMENT = 'Ecommerce sales info';
Create schema
1
2
3
create schema Sales_Data;
create schema Sales_Views;
create schema Sales_Stage;
Use the warehouse
1
2
3
USE WAREHOUSE_TRAINING_WH;
USE DATABASE SALES_DB;
USE SCHEMA Sales_Data;
Command to find the current environment
1
SELECT CURRENT_DATABASE(), CURRENT_SCHEMA(), CURRENT_WAREHOUSE();
Here’s a demo of Snowflake features. I learned that Snowflake has it’s own data marketplace, and also has a dashboard feature.
Here are others resources for Snowflake
Duckdb overview
Another database that I’ve been hearing about is Duckdb, and I’m honestly very interested in this one as it is light-weight and open-source. It can run in a laptop, and can process GBs of data fast.
It’s a file-based database which reminds me of SQLite except that DuckDB is mostly for analytical purposes (OLAP) rather than transactional (OLTP). It utilizes vectorized execution as opposed to tuple-at a time (row-based database), or column-at-a-time execution (column-based database). It sits somewhere between row-based and column-based in that the data is processed by columns but operates on batches of data at a time. Because of this, it is more memory efficient than column execution (e.g. pandas).
Here is Gabor Szarnyas’ presentation about Duckdb which talks in detail about Duckdb capabilities.
Duckdb isn’t a one-to-one comparison with Snowflake, though as it can only scale by memory, and is not distributed (nor with Apache Spark for that matter). It also runs locally. A counterpart to this is MotherDuck, which is a cloud data warehousing solution built on top of Duckdb (kind of like dbt Cloud to dbt core).
As a side note, I was delighted to learn that DuckDB has a SQL command to exclude columns! (Lol I know but you have no idea how cumbersome it is to write all 20+ columns only to exclude a few :p)
example from Duckdb Snippets
1
2
3
// This will select all information about ducks
// except their height and weight
SELECT * EXCLUDE (height, weight) FROM ducks;
Whereas the top Stack Overflow solution for this is
1
2
3
4
5
6
7
8
9
/* Get the data into a temp table */
SELECT * INTO #TempTable
FROM YourTable
/* Drop the columns that are not needed */
ALTER TABLE #TempTable
DROP COLUMN ColumnToDrop
/* Get results and drop temp table */
SELECT * FROM #TempTable
DROP TABLE #TempTable
I know explicitly writing column names is for “contracts” and exclude isn’t very production quality but it would be immensely useful in CTEs where the source columns have already been defined previously.
Okay end of side note :p
Do you think it’s normal to fan over a database? No pun intended. : ))
Three Laws of Robotics
I’m currently reading Isaac Asimov books. I’m on Book 2 (Foundation and Empire) of the Foundation Series. I kind of wanted to read the books before I watch Apple TV’s Foundation but I realized the series is totally different from the books. It was like preparing for an exam only to get entirely out-of-scope questions (which has happened too many times before :p)
Spoiler:
Anyway, the Three Laws of Robotics (aka Asimov’s Laws) didn’t originate in the Foundation Books (I just really want to talk about it :p ) but in one of his short stories in the I, Robot collection.
Spoiler:
The Three Laws of Robotics state:
- A robot may not injure a human being or, through inaction, allow a human being to come to harm.
- A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
- A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
The Three Laws of Robotics originally came from a science-fiction story in the 1940’s but it’s amazing how forward-looking it is. I’m not really into science-fiction books (I’d rather not mix the two :p) but I’m pulled into Asimov’s world nonetheless.
8: 2025-02-02 (Sun) - dbt
dbt - certifications and finished the course
I’ve finished the course today 🎉.
I’ve also moved the dbt notes into another post to keep the log tidier.
The last section is an interview about the official dbt certifications. I’m still not sure about doing the certification at this point (I kind of want to get more hands-on time with the tool first) but I like what the interviewee said about doing certifications - you learn a lot more about the tool, faster compared to just using it everyday. For me, I do get a lot of gains studying for certs. If I didn’t have those stock knowledge in the first place, it would have been a lot harder to think of other, better ways of approaching a problem. Mostly my qualms about doing certs is how expensive they are! :p
7: 2025-02-01 (Sat) - dbt
dbt - Advance Power user dbt core, introducing dbt to the company
Today isn’t as technical as the last couple of days. I’ve covered more features of the Power User for dbt core extension powered with AI, and tips on introducing dbt to the company.
The course is wrapping up as well, and I only have one full section left about certifications.
I learned about
- Advance Power User dbt core
- Introducing dbt to the company
6: 2025-01-31 (Fri) - dbt
dbt - variables, and dagster
Today I’ve covered dbt variables and orchestration with dagster. It was my first time setting up dagster. I actually liked dagster because the integration with dbt is tight. I was a bit overwhelmed though with all the coding at the backend to setup the orchestration. It might get easier if I take a deeper look at it. For now, it seems like a good tool to use with dbt.
5: 2025-01-30 (Thu) - dbt
dbt - great expectations, debugging, and logging
I’ve covered these topics today: dbt-great-expectations
, debugging, and logging.
The dbt-great-expectations package, though not really a port of the Python package, contains many tests that are useful for checking the model. I’m glad ge
was adapted into dbt as it’s also one of the popular data testing tool in Python.
4: 2025-01-29 (Wed) - dbt
dbt - snapshots, tests, macros, packages, docs, analyses, hooks, exposures.
Today was pretty full. I’ve covered dbt snapshots, tests, macros, third-party packages, documentation, analyses, hooks, and exposures. Not sure how I completed all of these today but these are pretty important components of dbt. I’m amazed about the documentation support in this tool, and snapshot is another feature in which I say “where has this been all my life?” To think that SCD is that straight-forward in dbt.
3: 2025-01-26 (Sun) - dbt
dbt - seeds, and source
Today, I’ve covered dbt seeds, and source. At first I thought - why did they need to relabel these CSV files and raw tables? The terminologies were a bit confusing but I guess I should just get used to these.
2: 2025-01-25 (Sat) - dbt, books, Python package, Copilot
dbt - models, and materialization
I’ve covered dbt models, and materialization. These are pretty core topics in dbt - because dbt is all about models!
Books for LLM, and Polars
Here are some materials I came upon today
LLM Engineer’s Handbook [Amazon] [Github]
- a resource when I get around to exploring LLMs
Polars Cookbook [Amazon]
pandera
pandera
is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust. Dataframes contain information that pandera explicitly validates at runtime. - [Docs]
Here’s an example from the docs. It’s interesting because it seems like a lighter version of Great Expectations wherein the data can be further validated using ranges and other conditions. It’s powerful for dataframe validations.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import pandera as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})
# define schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, checks=pa.Check.le(10)),
"column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
"column3": pa.Column(str, checks=[
pa.Check.str_startswith("value_"),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
validated_df = schema(df)
print(validated_df)
In Pydantic, something like this can also be used
1
2
3
4
5
6
7
8
9
from pydantic import BaseModel, conint, confloat
class Product(BaseModel):
quantity: conint(ge=1, le=100) # Validates that quantity is between 1 and 100
price: confloat(gt=0, le=1000) # Validates that price is greater than 0 and less than or equal to 1000
product = Product(quantity=10, price=500)
print(product)
Copilot installation, an update
The VS Code plugin that I was trying to install yesterday is working now and I am able to access chat on the side panel as well as see the prompts on screen. Not really sure what fixed it but it could be the reboot of my computer.
1: 2025-01-24 (Fri) - dbt, Copilot
dbt - introduction ,and setting up
I’m going through the dbt course in Udemy The Complete dbt (Data Build Tool) Bootcamp: Zero to Hero. I’ve setup a dbt project and created a Snowflake account.
It was between this and dbt Learn platform - I might go back to that for review later. Lecturers worked in Databricks and co-founded dbt Learn so I decided to do this course first - and in the process, subtract from the ever growing number of Udemy courses that I haven’t finished :p
Github Copilot
As an aside, I got distracted because the lecturer’s VS Code has Copilot enabled so I tried to setup mine. The free version is supposed to be one click and a Github authentication away but for some reason it’s buggy in my IDE. Leaving it alone for now.
0: 2024-11-16 (Sat) to 2025-01-08 (Wed) - DataExpert Data Engineering Bootcamp
DE Bootcamp
I finished Zach’s Free YouTube Data Engineering bootcamp (DataExport.io) which started November last year and will run until February 7 (the deadline was extended from last day of January).
The topics covered were:
- Dimensional Data Modeling
- Fact Data Modeling
- Apache Spark Fundamentals
- Applying Analytical Patterns
- Real-time pipelines with Flink and Kafka
- Data Visualization and Impact
- Data Pipeline Maintenance
- KPIs and Experimentation
- Data Quality Patterns
Zach Wilson did a good job of explaining the topics (I’m also very impressed with how well he can explain the labs while writing code without missing a beat). The Data Expert community was also an incredible lot, as some of the setup and homeworks were complicated without prior exposure.
It was a challenging 6 weeks of my life with lectures, labs, and homeworks so much so that there was some lingering emptiness when my schedule freed up as I finished the bootcamp. I’m glad I went through it and it’s a good jumping off point for my learning goal this year.
Sharing the link to my certification.