Post

2025 Learning Log

2025 Learning Log

In keeping up with my goal to do more learning this year, I’m allotting some time outside the daily grind for learning.

My overarching goal is to explore popular data engineering tools such as dbt, and cloud technologies such as Azure Fabric and Snowflake. I haven’t worked in Databricks since 2021 so it’ll be a good opportunity to re-learn and catch up with the new developments.

Here are some notes and impressions of how the learning is coming along so far, written in reverse chronological order.

I’ve moved some notes to separate posts

40. 2025-04-02 (Wednesday) - Apache Kafka

Topic: Spreading messages across partitions, partition Leader and Followers

39. 2025-03-31 (Monday) - Apache Kafka

Topics: Messages, Topics and partitions

38. 2025-03-27 (Thursday) - Apache Kafka

Topic: Kafka Topic

37. 2025-03-25 (Tuesday) - Apache Kafka

Topics: Apache Kafka, Broker, Zookeeper, Zookeeper ensembl multiple Kafka clusters, and default ports for Zookeeper, and broker

36. 2025-03-24 (Monday) - Apache Kafka

Topic: Producing and consuming messages

35. 2025-03-20 (Thursday) - Apache Kafka

I started learning Apache Kafka. I wanted to study Flink actually but since it comes downstream of Kafka, I figured I might as well learn a bit more about Kafka first.

Topics: Introduction to Kafka, installation and starting the server using Docker compose, and from binary.

34. 2025-03-19 (Wednesday) - Python

Topic: loguru

33. 2025-03-13 (Thursday) - Python

Topic: PyAutoGUI

32. 2025-03-12 (Wednesday) - Python

Topic: Python collections module

31. 2025-03-11 (Tuesday) - Python __main__, refactoring if-else statement, slice, any, guard clause, function currying

This statement if __name__ == "__main__" ensures that only the intended functions are run in the script, as importing a module also runs the module itself.

This if else statement

1
2
3
4
if x > 2:
    print("b")
else:
    print("a")

Can be written like this

1
print("b") if x > 2 else print("a")

I can’t agree that the latter is readable though (or I’m just not used to it)

Using any() on an iterable

1
2
numbers = [-1,-2-4,0,-3, -7]
has_positives = any(n > 0 for n in numbers)

Writing a Guard clause so that if a statement is not true, there’s no need to run the rest of the code

slice object

1
2
3
4
5
6
7
8
9
numbers: list[int] = list(range(1, 11))
text: str = "Hello, world!"

rev: slice = slice(None, None, -1)
f_five: slice = slice(None, 5)

print(numbers[rev]) # [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
print(text[rev]) # !dlrow, olleH
print(text[f_five]) # Hello

Currying creates specialized functions based on a general function

1
2
3
4
5
6
7
8
9
10
11
def multiply_setup(a: float) -> Callable:
    def multiply(b: float) -> float:
        return a * b

    return multiply

double: Callable = multiply_setup(2)
triple: Callable = multiply_setup(3)

print(double(2)) # 4
print(triple(10)) # 30

Refs:

30. 2025-03-10 (Monday) - GraphQL

I was looking into resources for GraphQL and found these interesting bits

  1. Microsoft Fabric now has GraphQL API data access layer
  2. The official GraphQL learning resource GrapQL Learn
  3. A Python package to create GraphQL endpoints based on dataclasses Stawberry

Some terms:

  • Schema - defines the structure of the data that can be returned
  • Types - describes what data can be queried from the API
  • Queries - used to retrieve data
  • Mutations - used to modify data
  • Subscriptions - used to retrieve real-time updates

29. 2025-03-06 (Thursday) - End of Snowflake course 🎉

Topics: Third party tools, and best practices

28. 2025-03-05 (Wednesday) - Snowflake

Topic: Snowflake Access Management

27. 2025-03-04 (Tuesday) - Snowflake

Topics: Materialized views, and data masking

26. 2025-03-03 (Monday) - Snowflake

Topic: Snowflake streams

25. 2025-02-27 (Thursday) - Snowflake

Topics: Data sampling, and tasks

24. 2025-02-26 (Wednesday) - Snowflake

Topic: Data sharing

23. 2025-02-25 (Tuesday) - Snowflake

Topics: Table types, Zero-copy cloning, Swapping

22. 2025-02-24 (Monday) - Snowflake

Topics: Time travel, and fail safe

21. 2025-02-21 (Friday) - Snowflake

Topic: Snowpipe

20. 2025-02-20 (Thursday) - Snowflake

Topics: Loading data from AWS, Azure, and GCP into Snowflake

19. 2025-02-19 (Wednesday) - Snowflake

Topics: Performance considerations, scaling up/down, scaling out, caching, and cluster keys

18. 2025-02-18 (Tuesday) - Snowflake

Topic: Loading unstructured data into Snowflake

17. 2025-02-17 (Monday) - Snowflake

Topics: Copy options, rejected records, load history

16. 2025-02-13 (Thursday) - Snowflake

Topics: COPY command, transformation, file format object

15. 2025-02-12 (Wednesday) - Snowflake

Topics: Editions, pricing and cost monitoring, roles

14. 2025-02-11 (Tuesday) - Snowflake

Taking a step towards my goals this year, I’ve started doing a deep-dive into Snowflake through learning about it in this course Snowflake Masterclass.

Topics: Setup, architecture overview, loading data

13. 2025-02-10 (Monday) - streamlit

I’ve checked out Streamlit, a Python library for creating web apps. Unlike other libraries or frameworks like Django or even Flask, Streamlit is able to spin up a web app fast using simple syntax. It is specially useful for data science and machine learning projects.

It is designed for quickly creating a data-driven web application. I’m not clear if it’s “production-quality” and opinions seem to be divided and depend on requirements or use-case.

e.g.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import streamlit as st
import numpy as np
import pandas as pd

st.title('A Sample Streamlit App')

st.markdown('## This is a line chart')

chart_data = pd.DataFrame(
     np.random.randn(20, 3),
     columns=['a', 'b', 'c'])

st.line_chart(chart_data)

st.markdown('## This is a table')

st.dataframe(chart_data)

streamlit-charts.png

More features in the Docs

12. 2025-02-09 (Sunday) - ERD, Mermaid

Today, I reviewed ERDs and revisited Mermaid 🧜‍♀️

11. 2025-02-06 (Thursday) - docker, dbt-duckdb, Duckdb resources

docker

I wanted to check Docker 🐳 and see if I can try and create a container for a data pipeline. Well, that is the goal but since I don’t use Docker these days, I needed to reacquaint myself with it first.

These are some of the intro I found

an example of Dockerfile

1
2
3
4
FROM python:3.9
ADD main.py .
RUN pip install scikit-learn
CMD ["python", "./main.py"]

.dockerignore specifies the files or paths that are excluded when copying to the container

Ideally, there is only one process per container

docker-compose.yml - for running multiple containers at the same time

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
version: '3'
services:
    web:
        build: .
        ports:
            - "8080:8080"
    db:
        image: "mysql"
        environment:
            MYSQL_ROOT_PASSWORD: password
        volumes:
            - db-data:/foo

volumes:
    db-data:

docker-compose up to run all the containers together
docker-compose down to shutdown all containers

Will revisit this topic in the succeeding days.

dbt-duckdb

Here is the repo for the adapter: dbt-duckdb

Installation should be

  • pip3 install dbt-duckdb

Duckdb resources

Putting these resouces here:

I just realized that this feature solves some of my challenging tasks: Since Duckdb is able to read a CSV and execute SQL query on it, there’s no need to open a raw CSV just to check the sum of columns for example. The computation is in-memory too by default so no need to persist a database for quick analysis.

1
SELECT * FROM read_csv_auto('path/to/your/file.csv');

or just using the terminal

1
$ duckdb -c "SELECT * FROM read_parquet('path/to/your/file.parquet');"

10. 2025-02-05 (Wednesday) - dbt-Fabric, Fireducks

dbt-Fabric

I was looking around for how to integrate dbt to Azure, and I found these resouces

The process looks straight-forward

  • Install the adapter in the virtual environment with Python 3.7 and up pip install dbt-fabric
  • Make sure to have the Microsoft ODBC Driver for Sql Server installed
  • Add an existing Fabric warehouse
  • The dbt profile in the home directory needs to be setup
  • Connect to the Azure warehouse, do the authentication
  • Check the connections
  • Then the dbt project can be built

Aside from Fabric, dbt also has integrations with other Azure data platforms:

Here is the docs for other dbt core platform connections

Fireducks

There’s another duck in the data space: Fireducks . It’s not related to DuckDB, instead it allows existing code using pandas to be more performant; it’s fully compatible with Pandas API.

This means there’s zero learning cost as all that’s needed is to replace pandas with fireducks and the code should be good to go

1
2
# import pandas as pd
import fireducks.pandas as pd

or via terminal (no need to change the import)

1
python3 -m fireducks.pandas main.py

Whereas for polars, the code would need to be rewritten; polars code is closer to PySpark.

e.g.

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
import polars as pl
from pyspark.sql.functions import col

# load the file ...
#
# Filter rows where 'column1' is greater than 10
# pandas
filtered_df = df[df['column1'] > 10]
# polars
filtered_df = df.filter(pl.col('column1') > 10
# PySpark
filtered_df = df.filter(col('column1') > 10)

Here’s a comparison of the performance: Pandas vs. FireDucks Performance Comparison. It surpassed both polars and pandas

9: 2025-02-04 (Tuesday) - Snowflake, Duckdb, Isaac Asimov Books

I’m taking a break from learning dbt, and switched focus to learning about databases: Snowflake, and Duckdb. Also, I took a bit of a break so that I can catch up on reading Isaac Asimov’s series.

Snowflake virtual warehouse

I found this free introducton in Udemy Snowflake Datawarehouse & Cloud Analytics - Introduction. This may not be the best resource out there though as it hasn’t been updated. However, I learned about provisioning a virtual warehouse in Snowflake with the following example commands

Virtual warehouse

  • collection of compute resources (CPUs and allocated memory)
  • needed to query data from snowflake, and load data into snowflake
  • autoscaling is available in enterprise version but not in standard version

Create a virtual warehouse

1
2
3
4
5
6
7
8
CREATE WAREHOUSE TRAINING_WH
WITH
	WAREHOUSE_SIZE = XSMALL --size
	AUTO_SUSPEND = 60 -- if idle for 60s will be suspended automatically
	AUTO_RESUME = TRUE -- if use executes a query, it will automatically resume on it's own without having to manually restarted the warehouse
	INITIALLY_SUSPEND = TRUE
	STATEMENT_QUEUED_TIMEOUT_IN_SECONDS = 300
	STATEMENT_TIMEOUT_IN_SECONDS = 600;

Alternatively, use the user interface as well

Create database

1
2
3
CREATE DATABASE SALES_DB
DATA_RETENTION_TIME_IN_DAYS = 0
COMMENT = 'Ecommerce sales info';

Create schema

1
2
3
create schema Sales_Data;
create schema Sales_Views;
create schema Sales_Stage;

Use the warehouse

1
2
3
USE WAREHOUSE_TRAINING_WH;
USE DATABASE SALES_DB;
USE SCHEMA Sales_Data;

Command to find the current environment

1
SELECT CURRENT_DATABASE(), CURRENT_SCHEMA(), CURRENT_WAREHOUSE();

Here’s a demo of Snowflake features. I learned that Snowflake has it’s own data marketplace, and also has a dashboard feature.

Here are others resources for Snowflake

Duckdb overview

Another database that I’ve been hearing about is Duckdb, and I’m honestly very interested in this one as it is light-weight and open-source. It can run in a laptop, and can process GBs of data fast.

It’s a file-based database which reminds me of SQLite except that DuckDB is mostly for analytical purposes (OLAP) rather than transactional (OLTP). It utilizes vectorized execution as opposed to tuple-at a time (row-based database), or column-at-a-time execution (column-based database). It sits somewhere between row-based and column-based in that the data is processed by columns but operates on batches of data at a time. Because of this, it is more memory efficient than column execution (e.g. pandas).

Here is Gabor Szarnyas’ presentation about Duckdb which talks in detail about Duckdb capabilities.

Duckdb isn’t a one-to-one comparison with Snowflake, though as it can only scale by memory, and is not distributed (nor with Apache Spark for that matter). It also runs locally. A counterpart to this is MotherDuck, which is a cloud data warehousing solution built on top of Duckdb (kind of like dbt Cloud to dbt core).

As a side note, I was delighted to learn that DuckDB has a SQL command to exclude columns! (Lol I know but you have no idea how cumbersome it is to write all 20+ columns only to exclude a few :p)

example from Duckdb Snippets

1
2
3
// This will select all information about ducks
// except their height and weight
SELECT * EXCLUDE (height, weight) FROM ducks;

Whereas the top Stack Overflow solution for this is

1
2
3
4
5
6
7
8
9
/* Get the data into a temp table */
SELECT * INTO #TempTable
FROM YourTable
/* Drop the columns that are not needed */
ALTER TABLE #TempTable
DROP COLUMN ColumnToDrop
/* Get results and drop temp table */
SELECT * FROM #TempTable
DROP TABLE #TempTable

I know explicitly writing column names is for “contracts” and exclude isn’t very production quality but it would be immensely useful in CTEs where the source columns have already been defined previously.

Okay end of side note :p

Do you think it’s normal to fan over a database? No pun intended. : ))

Three Laws of Robotics

I’m currently reading Isaac Asimov books. I’m on Book 2 (Foundation and Empire) of the Foundation Series. I kind of wanted to read the books before I watch Apple TV’s Foundation but I realized the series is totally different from the books. It was like preparing for an exam only to get entirely out-of-scope questions (which has happened too many times before :p)

Spoiler:

Book 1 is logical and clever strategies towards starting a foundation planet, whereas the Apple series is all pew, pew, guns etc. Not Book 2 though, it’s all pew, pew, space battle :p

Anyway, the Three Laws of Robotics (aka Asimov’s Laws) didn’t originate in the Foundation Books (I just really want to talk about it :p ) but in one of his short stories in the I, Robot collection.

Spoiler:

But isn’t Demerzel from Apple TV’s series inspired by this? Was it a cross-over? But didn’t the Robot series and the Foundation Series had cross-overs also?.

The Three Laws of Robotics state:

  1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
  2. A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
  3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

The Three Laws of Robotics originally came from a science-fiction story in the 1940’s but it’s amazing how forward-looking it is. I’m not really into science-fiction books (I’d rather not mix the two :p) but I’m pulled into Asimov’s world nonetheless.

8: 2025-02-02 (Sun) - dbt

dbt - certifications and finished the course

I’ve finished the course today 🎉.

I’ve also moved the dbt notes into another post to keep the log tidier.

The last section is an interview about the official dbt certifications. I’m still not sure about doing the certification at this point (I kind of want to get more hands-on time with the tool first) but I like what the interviewee said about doing certifications - you learn a lot more about the tool, faster compared to just using it everyday. For me, I do get a lot of gains studying for certs. If I didn’t have those stock knowledge in the first place, it would have been a lot harder to think of other, better ways of approaching a problem. Mostly my qualms about doing certs is how expensive they are! :p

7: 2025-02-01 (Sat) - dbt

dbt - Advance Power user dbt core, introducing dbt to the company

Today isn’t as technical as the last couple of days. I’ve covered more features of the Power User for dbt core extension powered with AI, and tips on introducing dbt to the company.

The course is wrapping up as well, and I only have one full section left about certifications.

I learned about

  • Advance Power User dbt core
  • Introducing dbt to the company

6: 2025-01-31 (Fri) - dbt

dbt - variables, and dagster

Today I’ve covered dbt variables and orchestration with dagster. It was my first time setting up dagster. I actually liked dagster because the integration with dbt is tight. I was a bit overwhelmed though with all the coding at the backend to setup the orchestration. It might get easier if I take a deeper look at it. For now, it seems like a good tool to use with dbt.

5: 2025-01-30 (Thu) - dbt

dbt - great expectations, debugging, and logging

I’ve covered these topics today: dbt-great-expectations, debugging, and logging.

The dbt-great-expectations package, though not really a port of the Python package, contains many tests that are useful for checking the model. I’m glad ge was adapted into dbt as it’s also one of the popular data testing tool in Python.

4: 2025-01-29 (Wed) - dbt

dbt - snapshots, tests, macros, packages, docs, analyses, hooks, exposures.

Today was pretty full. I’ve covered dbt snapshots, tests, macros, third-party packages, documentation, analyses, hooks, and exposures. Not sure how I completed all of these today but these are pretty important components of dbt. I’m amazed about the documentation support in this tool, and snapshot is another feature in which I say “where has this been all my life?” To think that SCD is that straight-forward in dbt.

3: 2025-01-26 (Sun) - dbt

dbt - seeds, and source

Today, I’ve covered dbt seeds, and source. At first I thought - why did they need to relabel these CSV files and raw tables? The terminologies were a bit confusing but I guess I should just get used to these.

2: 2025-01-25 (Sat) - dbt, books, Python package, Copilot

dbt - models, and materialization

I’ve covered dbt models, and materialization. These are pretty core topics in dbt - because dbt is all about models!

Books for LLM, and Polars

Here are some materials I came upon today

LLM Engineer’s Handbook [Amazon] [Github]

  • a resource when I get around to exploring LLMs

Polars Cookbook [Amazon]

  • Polars is a fast alternative to Pandas specially for large datasets

pandera

pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust. Dataframes contain information that pandera explicitly validates at runtime. - [Docs]

Here’s an example from the docs. It’s interesting because it seems like a lighter version of Great Expectations wherein the data can be further validated using ranges and other conditions. It’s powerful for dataframe validations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import pandera as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

validated_df = schema(df)
print(validated_df)

In Pydantic, something like this can also be used

1
2
3
4
5
6
7
8
9
from pydantic import BaseModel, conint, confloat

class Product(BaseModel):
    quantity: conint(ge=1, le=100)  # Validates that quantity is between 1 and 100
    price: confloat(gt=0, le=1000)   # Validates that price is greater than 0 and less than or equal to 1000

product = Product(quantity=10, price=500)
print(product)

Copilot installation, an update

The VS Code plugin that I was trying to install yesterday is working now and I am able to access chat on the side panel as well as see the prompts on screen. Not really sure what fixed it but it could be the reboot of my computer.

1: 2025-01-24 (Fri) - dbt, Copilot

dbt - introduction ,and setting up

I’m going through the dbt course in Udemy The Complete dbt (Data Build Tool) Bootcamp: Zero to Hero. I’ve setup a dbt project and created a Snowflake account.

It was between this and dbt Learn platform - I might go back to that for review later. Lecturers worked in Databricks and co-founded dbt Learn so I decided to do this course first - and in the process, subtract from the ever growing number of Udemy courses that I haven’t finished :p

Github Copilot

As an aside, I got distracted because the lecturer’s VS Code has Copilot enabled so I tried to setup mine. The free version is supposed to be one click and a Github authentication away but for some reason it’s buggy in my IDE. Leaving it alone for now.

0: 2024-11-16 (Sat) to 2025-01-08 (Wed) - DataExpert Data Engineering Bootcamp

DE Bootcamp

I finished Zach’s Free YouTube Data Engineering bootcamp (DataExport.io) which started November last year and will run until February 7 (the deadline was extended from last day of January).

The topics covered were:

  • Dimensional Data Modeling
  • Fact Data Modeling
  • Apache Spark Fundamentals
  • Applying Analytical Patterns
  • Real-time pipelines with Flink and Kafka
  • Data Visualization and Impact
  • Data Pipeline Maintenance
  • KPIs and Experimentation
  • Data Quality Patterns

Zach Wilson did a good job of explaining the topics (I’m also very impressed with how well he can explain the labs while writing code without missing a beat). The Data Expert community was also an incredible lot, as some of the setup and homeworks were complicated without prior exposure.

It was a challenging 6 weeks of my life with lectures, labs, and homeworks so much so that there was some lingering emptiness when my schedule freed up as I finished the bootcamp. I’m glad I went through it and it’s a good jumping off point for my learning goal this year.

Sharing the link to my certification.

This post is licensed under CC BY 4.0 by the author.