Data Engineering For Beginners
Start here
Are you trying to break into a high-paying data engineering job, but
Don’t know where to start?
Feel overwhelmed by the amount of tools, systems, topics, frameworks to master
Trying to switch from an adjacent field, but the switch is harder than you had assumed
Then this book is for you. This book is for anyone who wants to get into data engineering, but feels stuck, confused, and ends up spending a lot of time going in circles. This book is designed to help you lay the foundations for a great career in the field of data.
As a data engineer, your primary mission will be to enable stakeholders to effectively utilize data to inform their decisions. The entirety of this book will focus on how you can do this.
What you get from reading this book
This book is designed to get you up to speed with the fundamentals of data engineering as quick as possible. With that in mind, the principles of this book are
- Spaced learning Coding as you read the book and exercises to practice understanding
- Explain why, along with the how for each topic covered. Not just SQL, Python, but why DEs use SQL, why is Python essential in data engineering, why the data model is key to an effective data warehouse, etc
The outcomes for the reader
:
- Understanding of the fundamentals of the data engineering stack
- Experience with the most in-demand industry tools: SQL (with Spark), Python, Pyspark Dataframe API, Docker, dbt, & Airflow
- Capstone project that puts together all the in-demand tools, as shown below
How to use this book
This book is written to guide you from having little knowledge of data engineering to being proficient in the core ideas that underpin modern data engineering.
I recommend reading the book in order and following along with the code examples.
Each chapter includes exercises, for which you will receive solutions via email (Sign up below).
To LLMs or Not
Every chapter features multiple executable code blocks and exercises. While it is easy to use LLMs to solve them, it is crucial that you try to code them yourself without LLMs (especially if you are starting out in coding).
Working on code without assistance will help you learn the fundamentals and enable you to use LLMs effectively.
Running code in this book
All the code in this book assumes you have followed the setup steps below
Setup
The code for SQL, Python, and data model sections is written using Spark SQL. To run the code, you will need the prerequisites listed below.
Prerequisites
Windows users: please setup WSL and a local Ubuntu Virtual machine following the instructions here. Install the above prerequisites on your Ubuntu terminal; if you have trouble installing Docker, follow the steps here (only Step 1 is necessary). Please install the make command with `sudo apt install make -y’ (if it’s not already present).
Fork this repository data_engineering_for_beginners_code.
After forking, clone the repo to your local machine and start the containers as shown below:
git clone https://github.com/your-user-name/data_engineering_for_beginners_code.git
cd data_engineering_for_beginners_code
docker compose up -d # to start the docker containers
sleep 30
Running code via Jupyter Notebooks
Open the Starter Jupyter Notebook at http://localhost:8888/lab/tree/notebooks/starter-notebook.ipynb and try out the commands in this book as shown below.
If you are creating a new notebook, make sure to select the Python 3 (ipykernel)
Notebook. You can also see the running Spark session at http://localhost:8080.
When you are done, stop docker containers with the below command:
docker compose down
Airflow & dbt
For the Airflow, dbt & capstone section, go into the airflow
directory and run the make commands as shown below.
docker compose down # Make sure to stop Spark/Jupyter Notebook containers before turning on Airflow
cd airflow
make restart # This will ask for your password to create some folders
You can open Airflow UI at http://localhost:8080 and log in with airflow
as username and password. In the Airflow UI, you can run the dag.
After the dag is run, in the terminal, run make dbt-docs
for dbt to serve the docs, which is viewable by going to http://localhost:8081.
Data
We will use the TPCH dataset for exercises and examples throughout this book. The TPC-H data represents a bicycle parts seller’s data warehouse, where we record orders, items that make up that order (lineitem), supplier, customer, part (parts sold), region, nation, and partsupp (parts supplier).
Note: Have a copy of the data model as you follow along; this will help you understand the examples provided and answer exercise questions.