Data Engineering For Beginners

Author

Joseph Machado

Published

July 25, 2025

Start here

Are you trying to break into a high-paying data engineering job, but

Don’t know where to start?

Feel overwhelmed by the amount of tools, systems, topics, frameworks to master

Trying to switch from an adjacent field, but the switch is harder than you had assumed

Then this book is for you. This book is for anyone who wants to get into data engineering, but feels stuck, confused, and ends up spending a lot of time going in circles. This book is designed to help you lay the foundations for a great career in the field of data.

As a data engineer, your primary mission will be to enable stakeholders to effectively utilize data to inform their decisions. The entirety of this book will focus on how you can do this.

What you get from reading this book

This book is designed to get you up to speed with the fundamentals of data engineering as quick as possible. With that in mind, the principles of this book are

Spaced learning Coding as you read the book and exercises to practice understanding
Explain why, along with the how for each topic covered. Not just SQL, Python, but why DEs use SQL, why is Python essential in data engineering, why the data model is key to an effective data warehouse, etc

The outcomes for the reader:

Understanding of the fundamentals of the data engineering stack
Experience with the most in-demand industry tools: SQL (with Spark), Python, Pyspark Dataframe API, Docker, dbt, & Airflow
Capstone project that puts together all the in-demand tools, as shown below

How to use this book

This book is written to guide you from having little knowledge of data engineering to being proficient in the core ideas that underpin modern data engineering.

I recommend reading the book in order and following along with the code examples.

Each chapter includes exercises, for which you will receive solutions via email (Sign up below).

To LLMs or Not

Every chapter features multiple executable code blocks and exercises. While it is easy to use LLMs to solve them, it is crucial that you try to code them yourself without LLMs (especially if you are starting out in coding).

Working on code without assistance will help you learn the fundamentals and enable you to use LLMs effectively.

Running code in this book

All the code in this book assumes you have followed the setup steps below

Setup

The code for SQL, Python, and data model sections is written using Spark SQL. To run the code, you will need the prerequisites listed below.

Prerequisites

Windows users: please setup WSL and a local Ubuntu Virtual machine following the instructions here. Install the above prerequisites on your Ubuntu terminal; if you have trouble installing Docker, follow the steps here (only Step 1 is necessary). Please install the make command with `sudo apt install make -y’ (if it’s not already present).
Fork this repository data_engineering_for_beginners_code.
GiitHub Fork After forking, clone the repo to your local machine and start the containers as shown below:

git clone https://github.com/your-user-name/data_engineering_for_beginners_code.git
cd data_engineering_for_beginners_code
docker compose up -d # to start the docker containers
sleep 30

Running code via Jupyter Notebooks

Open the Starter Jupyter Notebook at http://localhost:8888/lab/tree/notebooks/starter-notebook.ipynb and try out the commands in this book as shown below.

If you are creating a new notebook, make sure to select the Python 3 (ipykernel) Notebook. You can also see the running Spark session at http://localhost:8080.

When you are done, stop docker containers with the below command:

docker compose down

Airflow & dbt

For the Airflow, dbt & capstone section, go into the airflow directory and run the make commands as shown below.

docker compose down # Make sure to stop Spark/Jupyter Notebook containers before turning on Airflow 
cd airflow
make restart # This will ask for your password to create some folders

You can open Airflow UI at http://localhost:8080 and log in with airflow as username and password. In the Airflow UI, you can run the dag.

After the dag is run, in the terminal, run make dbt-docs for dbt to serve the docs, which is viewable by going to http://localhost:8081.

Data

We will use the TPCH dataset for exercises and examples throughout this book. The TPC-H data represents a bicycle parts seller’s data warehouse, where we record orders, items that make up that order (lineitem), supplier, customer, part (parts sold), region, nation, and partsupp (parts supplier).

Note: Have a copy of the data model as you follow along; this will help you understand the examples provided and answer exercise questions.