2025 Learning Log
In keeping up with my goal to do more learning this year, I’m allotting some time outside the daily grind for learning.
My overarching goal is to explore popular data engineering tools such as dbt
, and cloud technologies such as Azure Fabric and Snowflake. I haven’t worked in Databricks since 2021 so it’ll be a good opportunity to re-learn and catch up with the new developments.
Here are some notes and impressions of how the learning is coming along so far, written in reverse chronological order.
I’ve moved some notes to separate posts
40. 2025-04-02 (Wednesday) - Apache Kafka
Topic: Spreading messages across partitions, partition Leader and Followers
39. 2025-03-31 (Monday) - Apache Kafka
Topics: Messages, Topics and partitions
38. 2025-03-27 (Thursday) - Apache Kafka
Topic: Kafka Topic
37. 2025-03-25 (Tuesday) - Apache Kafka
Topics: Apache Kafka, Broker, Zookeeper, Zookeeper ensembl multiple Kafka clusters, and default ports for Zookeeper, and broker
36. 2025-03-24 (Monday) - Apache Kafka
Topic: Producing and consuming messages
35. 2025-03-20 (Thursday) - Apache Kafka
I started learning Apache Kafka. I wanted to study Flink actually but since it comes downstream of Kafka, I figured I might as well learn a bit more about Kafka first.
Topics: Introduction to Kafka, installation and starting the server using Docker compose, and from binary.
34. 2025-03-19 (Wednesday) - Python
Topic: loguru
33. 2025-03-13 (Thursday) - Python
Topic: PyAutoGUI
32. 2025-03-12 (Wednesday) - Python
Topic: Python collections module
31. 2025-03-11 (Tuesday) - Python __main__
, refactoring if-else statement, slice, any, guard clause, function currying
This statement if __name__ == "__main__"
ensures that only the intended functions are run in the script, as importing a module also runs the module itself.
This if else statement
1
2
3
4
if x > 2:
print("b")
else:
print("a")
Can be written like this
1
print("b") if x > 2 else print("a")
I can’t agree that the latter is readable though (or I’m just not used to it)
Using any()
on an iterable
1
2
numbers = [-1,-2-4,0,-3, -7]
has_positives = any(n > 0 for n in numbers)
Writing a Guard clause so that if a statement is not true, there’s no need to run the rest of the code
slice
object
1
2
3
4
5
6
7
8
9
numbers: list[int] = list(range(1, 11))
text: str = "Hello, world!"
rev: slice = slice(None, None, -1)
f_five: slice = slice(None, 5)
print(numbers[rev]) # [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
print(text[rev]) # !dlrow, olleH
print(text[f_five]) # Hello
Currying creates specialized functions based on a general function
1
2
3
4
5
6
7
8
9
10
11
def multiply_setup(a: float) -> Callable:
def multiply(b: float) -> float:
return a * b
return multiply
double: Callable = multiply_setup(2)
triple: Callable = multiply_setup(3)
print(double(2)) # 4
print(triple(10)) # 30
Refs:
30. 2025-03-10 (Monday) - GraphQL
I was looking into resources for GraphQL and found these interesting bits
- Microsoft Fabric now has GraphQL API data access layer
- The official GraphQL learning resource GrapQL Learn
- A Python package to create GraphQL endpoints based on dataclasses Stawberry
Some terms:
- Schema - defines the structure of the data that can be returned
- Types - describes what data can be queried from the API
- Queries - used to retrieve data
- Mutations - used to modify data
- Subscriptions - used to retrieve real-time updates
29. 2025-03-06 (Thursday) - End of Snowflake course 🎉
Topics: Third party tools, and best practices
28. 2025-03-05 (Wednesday) - Snowflake
Topic: Snowflake Access Management
27. 2025-03-04 (Tuesday) - Snowflake
Topics: Materialized views, and data masking
26. 2025-03-03 (Monday) - Snowflake
Topic: Snowflake streams
25. 2025-02-27 (Thursday) - Snowflake
Topics: Data sampling, and tasks
24. 2025-02-26 (Wednesday) - Snowflake
Topic: Data sharing
23. 2025-02-25 (Tuesday) - Snowflake
Topics: Table types, Zero-copy cloning, Swapping
22. 2025-02-24 (Monday) - Snowflake
Topics: Time travel, and fail safe
21. 2025-02-21 (Friday) - Snowflake
Topic: Snowpipe
20. 2025-02-20 (Thursday) - Snowflake
Topics: Loading data from AWS, Azure, and GCP into Snowflake
19. 2025-02-19 (Wednesday) - Snowflake
Topics: Performance considerations, scaling up/down, scaling out, caching, and cluster keys
18. 2025-02-18 (Tuesday) - Snowflake
Topic: Loading unstructured data into Snowflake
17. 2025-02-17 (Monday) - Snowflake
Topics: Copy options, rejected records, load history
16. 2025-02-13 (Thursday) - Snowflake
Topics: COPY command, transformation, file format object
15. 2025-02-12 (Wednesday) - Snowflake
Topics: Editions, pricing and cost monitoring, roles
14. 2025-02-11 (Tuesday) - Snowflake
Taking a step towards my goals this year, I’ve started doing a deep-dive into Snowflake through learning about it in this course Snowflake Masterclass.
Topics: Setup, architecture overview, loading data
13. 2025-02-10 (Monday) - streamlit
I’ve checked out Streamlit, a Python library for creating web apps. Unlike other libraries or frameworks like Django or even Flask, Streamlit is able to spin up a web app fast using simple syntax. It is specially useful for data science and machine learning projects.
It is designed for quickly creating a data-driven web application. I’m not clear if it’s “production-quality” and opinions seem to be divided and depend on requirements or use-case.
e.g.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import streamlit as st
import numpy as np
import pandas as pd
st.title('A Sample Streamlit App')
st.markdown('## This is a line chart')
chart_data = pd.DataFrame(
np.random.randn(20, 3),
columns=['a', 'b', 'c'])
st.line_chart(chart_data)
st.markdown('## This is a table')
st.dataframe(chart_data)

More features in the Docs
12. 2025-02-09 (Sunday) - ERD, Mermaid
Today, I reviewed ERDs and revisited Mermaid 🧜♀️
11. 2025-02-06 (Thursday) - docker, dbt-duckdb, Duckdb resources
docker
I wanted to check Docker 🐳 and see if I can try and create a container for a data pipeline. Well, that is the goal but since I don’t use Docker these days, I needed to reacquaint myself with it first.
These are some of the intro I found
- The intro to Docker I wish I had when I started
- Learn Docker in 7 Easy Steps - Full Beginner’s Tutorial
- Containerize Python Applications with Docker
an example of Dockerfile
1
2
3
4
FROM python:3.9
ADD main.py .
RUN pip install scikit-learn
CMD ["python", "./main.py"]
.dockerignore
specifies the files or paths that are excluded when copying to the container
Ideally, there is only one process per container
docker-compose.yml
- for running multiple containers at the same time
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
version: '3'
services:
web:
build: .
ports:
- "8080:8080"
db:
image: "mysql"
environment:
MYSQL_ROOT_PASSWORD: password
volumes:
- db-data:/foo
volumes:
db-data:
docker-compose up
to run all the containers together
docker-compose down
to shutdown all containers
Will revisit this topic in the succeeding days.
dbt-duckdb
Here is the repo for the adapter: dbt-duckdb
Installation should be
pip3 install dbt-duckdb
Duckdb resources
Putting these resouces here:
- Duckdb tutorial for beginners
- YT Channel @motherduckdb
I just realized that this feature solves some of my challenging tasks: Since Duckdb is able to read a CSV and execute SQL query on it, there’s no need to open a raw CSV just to check the sum of columns for example. The computation is in-memory too by default so no need to persist a database for quick analysis.
1
SELECT * FROM read_csv_auto('path/to/your/file.csv');
or just using the terminal
1
$ duckdb -c "SELECT * FROM read_parquet('path/to/your/file.parquet');"
10. 2025-02-05 (Wednesday) - dbt-Fabric, Fireducks
dbt-Fabric
I was looking around for how to integrate dbt to Azure, and I found these resouces
The process looks straight-forward
- Install the adapter in the virtual environment with Python 3.7 and up
pip install dbt-fabric
- Make sure to have the Microsoft ODBC Driver for Sql Server installed
- Add an existing Fabric warehouse
- The dbt profile in the home directory needs to be setup
- Connect to the Azure warehouse, do the authentication
- Check the connections
- Then the dbt project can be built
Aside from Fabric, dbt also has integrations with other Azure data platforms:
- Synapse
- Data Factory
- SQL Server (SQL Server 2017, SQL Server 2019, SQL Server 2022 and Azure SQL Database)
Here is the docs for other dbt core platform connections
Fireducks
There’s another duck in the data space: Fireducks . It’s not related to DuckDB, instead it allows existing code using pandas
to be more performant; it’s fully compatible with Pandas API.
This means there’s zero learning cost as all that’s needed is to replace pandas
with fireducks
and the code should be good to go
1
2
# import pandas as pd
import fireducks.pandas as pd
or via terminal (no need to change the import)
1
python3 -m fireducks.pandas main.py
Whereas for polars
, the code would need to be rewritten; polars code is closer to PySpark.
e.g.
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
import polars as pl
from pyspark.sql.functions import col
# load the file ...
#
# Filter rows where 'column1' is greater than 10
# pandas
filtered_df = df[df['column1'] > 10]
# polars
filtered_df = df.filter(pl.col('column1') > 10
# PySpark
filtered_df = df.filter(col('column1') > 10)
Here’s a comparison of the performance: Pandas vs. FireDucks Performance Comparison. It surpassed both polars
and pandas
9: 2025-02-04 (Tuesday) - Snowflake, Duckdb, Isaac Asimov Books
I’m taking a break from learning dbt, and switched focus to learning about databases: Snowflake, and Duckdb. Also, I took a bit of a break so that I can catch up on reading Isaac Asimov’s series.
Snowflake virtual warehouse
I found this free introducton in Udemy Snowflake Datawarehouse & Cloud Analytics - Introduction. This may not be the best resource out there though as it hasn’t been updated. However, I learned about provisioning a virtual warehouse in Snowflake with the following example commands
Virtual warehouse
- collection of compute resources (CPUs and allocated memory)
- needed to query data from snowflake, and load data into snowflake
- autoscaling is available in enterprise version but not in standard version
Create a virtual warehouse
1
2
3
4
5
6
7
8
CREATE WAREHOUSE TRAINING_WH
WITH
WAREHOUSE_SIZE = XSMALL --size
AUTO_SUSPEND = 60 -- if idle for 60s will be suspended automatically
AUTO_RESUME = TRUE -- if use executes a query, it will automatically resume on it's own without having to manually restarted the warehouse
INITIALLY_SUSPEND = TRUE
STATEMENT_QUEUED_TIMEOUT_IN_SECONDS = 300
STATEMENT_TIMEOUT_IN_SECONDS = 600;
Alternatively, use the user interface as well
Create database
1
2
3
CREATE DATABASE SALES_DB
DATA_RETENTION_TIME_IN_DAYS = 0
COMMENT = 'Ecommerce sales info';
Create schema
1
2
3
create schema Sales_Data;
create schema Sales_Views;
create schema Sales_Stage;
Use the warehouse
1
2
3
USE WAREHOUSE_TRAINING_WH;
USE DATABASE SALES_DB;
USE SCHEMA Sales_Data;
Command to find the current environment
1
SELECT CURRENT_DATABASE(), CURRENT_SCHEMA(), CURRENT_WAREHOUSE();
Here’s a demo of Snowflake features. I learned that Snowflake has it’s own data marketplace, and also has a dashboard feature.
Here are others resources for Snowflake
Duckdb overview
Another database that I’ve been hearing about is Duckdb, and I’m honestly very interested in this one as it is light-weight and open-source. It can run in a laptop, and can process GBs of data fast.
It’s a file-based database which reminds me of SQLite except that DuckDB is mostly for analytical purposes (OLAP) rather than transactional (OLTP). It utilizes vectorized execution as opposed to tuple-at a time (row-based database), or column-at-a-time execution (column-based database). It sits somewhere between row-based and column-based in that the data is processed by columns but operates on batches of data at a time. Because of this, it is more memory efficient than column execution (e.g. pandas).
Here is Gabor Szarnyas’ presentation about Duckdb which talks in detail about Duckdb capabilities.
Duckdb isn’t a one-to-one comparison with Snowflake, though as it can only scale by memory, and is not distributed (nor with Apache Spark for that matter). It also runs locally. A counterpart to this is MotherDuck, which is a cloud data warehousing solution built on top of Duckdb (kind of like dbt Cloud to dbt core).
As a side note, I was delighted to learn that DuckDB has a SQL command to exclude columns! (Lol I know but you have no idea how cumbersome it is to write all 20+ columns only to exclude a few :p)
example from Duckdb Snippets
1
2
3
// This will select all information about ducks
// except their height and weight
SELECT * EXCLUDE (height, weight) FROM ducks;
Whereas the top Stack Overflow solution for this is
1
2
3
4
5
6
7
8
9
/* Get the data into a temp table */
SELECT * INTO #TempTable
FROM YourTable
/* Drop the columns that are not needed */
ALTER TABLE #TempTable
DROP COLUMN ColumnToDrop
/* Get results and drop temp table */
SELECT * FROM #TempTable
DROP TABLE #TempTable
I know explicitly writing column names is for “contracts” and exclude isn’t very production quality but it would be immensely useful in CTEs where the source columns have already been defined previously.
Okay end of side note :p
Do you think it’s normal to fan over a database? No pun intended. : ))
Three Laws of Robotics
I’m currently reading Isaac Asimov books. I’m on Book 2 (Foundation and Empire) of the Foundation Series. I kind of wanted to read the books before I watch Apple TV’s Foundation but I realized the series is totally different from the books. It was like preparing for an exam only to get entirely out-of-scope questions (which has happened too many times before :p)
Spoiler:
Anyway, the Three Laws of Robotics (aka Asimov’s Laws) didn’t originate in the Foundation Books (I just really want to talk about it :p ) but in one of his short stories in the I, Robot collection.
Spoiler:
The Three Laws of Robotics state:
- A robot may not injure a human being or, through inaction, allow a human being to come to harm.
- A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
- A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
The Three Laws of Robotics originally came from a science-fiction story in the 1940’s but it’s amazing how forward-looking it is. I’m not really into science-fiction books (I’d rather not mix the two :p) but I’m pulled into Asimov’s world nonetheless.
8: 2025-02-02 (Sun) - dbt
dbt - certifications and finished the course
I’ve finished the course today 🎉.
I’ve also moved the dbt notes into another post to keep the log tidier.
The last section is an interview about the official dbt certifications. I’m still not sure about doing the certification at this point (I kind of want to get more hands-on time with the tool first) but I like what the interviewee said about doing certifications - you learn a lot more about the tool, faster compared to just using it everyday. For me, I do get a lot of gains studying for certs. If I didn’t have those stock knowledge in the first place, it would have been a lot harder to think of other, better ways of approaching a problem. Mostly my qualms about doing certs is how expensive they are! :p
7: 2025-02-01 (Sat) - dbt
dbt - Advance Power user dbt core, introducing dbt to the company
Today isn’t as technical as the last couple of days. I’ve covered more features of the Power User for dbt core extension powered with AI, and tips on introducing dbt to the company.
The course is wrapping up as well, and I only have one full section left about certifications.
I learned about
- Advance Power User dbt core
- Introducing dbt to the company
6: 2025-01-31 (Fri) - dbt
dbt - variables, and dagster
Today I’ve covered dbt variables and orchestration with dagster. It was my first time setting up dagster. I actually liked dagster because the integration with dbt is tight. I was a bit overwhelmed though with all the coding at the backend to setup the orchestration. It might get easier if I take a deeper look at it. For now, it seems like a good tool to use with dbt.
5: 2025-01-30 (Thu) - dbt
dbt - great expectations, debugging, and logging
I’ve covered these topics today: dbt-great-expectations
, debugging, and logging.
The dbt-great-expectations package, though not really a port of the Python package, contains many tests that are useful for checking the model. I’m glad ge
was adapted into dbt as it’s also one of the popular data testing tool in Python.
4: 2025-01-29 (Wed) - dbt
dbt - snapshots, tests, macros, packages, docs, analyses, hooks, exposures.
Today was pretty full. I’ve covered dbt snapshots, tests, macros, third-party packages, documentation, analyses, hooks, and exposures. Not sure how I completed all of these today but these are pretty important components of dbt. I’m amazed about the documentation support in this tool, and snapshot is another feature in which I say “where has this been all my life?” To think that SCD is that straight-forward in dbt.
3: 2025-01-26 (Sun) - dbt
dbt - seeds, and source
Today, I’ve covered dbt seeds, and source. At first I thought - why did they need to relabel these CSV files and raw tables? The terminologies were a bit confusing but I guess I should just get used to these.
2: 2025-01-25 (Sat) - dbt, books, Python package, Copilot
dbt - models, and materialization
I’ve covered dbt models, and materialization. These are pretty core topics in dbt - because dbt is all about models!
Books for LLM, and Polars
Here are some materials I came upon today
LLM Engineer’s Handbook [Amazon] [Github]
- a resource when I get around to exploring LLMs
Polars Cookbook [Amazon]
pandera
pandera
is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust. Dataframes contain information that pandera explicitly validates at runtime. - [Docs]
Here’s an example from the docs. It’s interesting because it seems like a lighter version of Great Expectations wherein the data can be further validated using ranges and other conditions. It’s powerful for dataframe validations.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import pandera as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})
# define schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, checks=pa.Check.le(10)),
"column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
"column3": pa.Column(str, checks=[
pa.Check.str_startswith("value_"),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
validated_df = schema(df)
print(validated_df)
In Pydantic, something like this can also be used
1
2
3
4
5
6
7
8
9
from pydantic import BaseModel, conint, confloat
class Product(BaseModel):
quantity: conint(ge=1, le=100) # Validates that quantity is between 1 and 100
price: confloat(gt=0, le=1000) # Validates that price is greater than 0 and less than or equal to 1000
product = Product(quantity=10, price=500)
print(product)
Copilot installation, an update
The VS Code plugin that I was trying to install yesterday is working now and I am able to access chat on the side panel as well as see the prompts on screen. Not really sure what fixed it but it could be the reboot of my computer.
1: 2025-01-24 (Fri) - dbt, Copilot
dbt - introduction ,and setting up
I’m going through the dbt course in Udemy The Complete dbt (Data Build Tool) Bootcamp: Zero to Hero. I’ve setup a dbt project and created a Snowflake account.
It was between this and dbt Learn platform - I might go back to that for review later. Lecturers worked in Databricks and co-founded dbt Learn so I decided to do this course first - and in the process, subtract from the ever growing number of Udemy courses that I haven’t finished :p
Github Copilot
As an aside, I got distracted because the lecturer’s VS Code has Copilot enabled so I tried to setup mine. The free version is supposed to be one click and a Github authentication away but for some reason it’s buggy in my IDE. Leaving it alone for now.
0: 2024-11-16 (Sat) to 2025-01-08 (Wed) - DataExpert Data Engineering Bootcamp
DE Bootcamp
I finished Zach’s Free YouTube Data Engineering bootcamp (DataExport.io) which started November last year and will run until February 7 (the deadline was extended from last day of January).
The topics covered were:
- Dimensional Data Modeling
- Fact Data Modeling
- Apache Spark Fundamentals
- Applying Analytical Patterns
- Real-time pipelines with Flink and Kafka
- Data Visualization and Impact
- Data Pipeline Maintenance
- KPIs and Experimentation
- Data Quality Patterns
Zach Wilson did a good job of explaining the topics (I’m also very impressed with how well he can explain the labs while writing code without missing a beat). The Data Expert community was also an incredible lot, as some of the setup and homeworks were complicated without prior exposure.
It was a challenging 6 weeks of my life with lectures, labs, and homeworks so much so that there was some lingering emptiness when my schedule freed up as I finished the bootcamp. I’m glad I went through it and it’s a good jumping off point for my learning goal this year.
Sharing the link to my certification.