13  Capstone Project

Over the past few chapters we went over

  1. Data transformation with Spark SQL
  2. Data modeling with dbt
  3. Scheduling and orchestrating with Airflow

In this capstone project, we will go over how you can present your expertise as a data engineer to a potential hiring manager.

The main objectives for this capstone project are 1. Understanding how the different components of data engineering work with each other 2. How to model and transform data in the 3-hop architecture 3. Clearly explain what your pipeline is doing, why, and how

Let’s assume we are working on modeling the TPCH data and creating a data mart for the sales team to create customer metrics that they can use to strategize how to cold-call customers.

13.1 Presentation matters

When a hiring manager reviews your project, assume that they will not read the code. Typically, when people look at projects, they browse high-level sections. These includes

  1. Outcome of your project
  2. High-level architecture
  3. Project structure to understand how your code works
  4. Browse code for clarity and code cleanliness

We will see how you can address these.

13.2 Run the pipeline and visualize the results

Open Airflow UI at http://localhost:8080, login with username/password as airflow and run the dag as shown below.

DAG

We use the python Plotly library to create a simple HTML dashboard as shown below.

UI

13.3 Start with the outcome

We are creating data to support the sales team’s customer outreach efforts. For this, we need to present customers who are most likely to convert. While this is a complex data science question, a simple approach could be to target customers who have the highest average order value (assuming high/low order values are outliers).

Create a dashboard to show the top 10 customers by average order values as a descending bar chart.

UI

Note The Python script to create the dashboard is available at airflow/tpch_analytics/dashboard.py.

13.4 High-level architecture

The objective of this is to show your expertise in

  1. Designing data pipelines, by following industry standard 3-hop architecture
  2. Industry standard tools like dbt, Airflow, and Spark
  3. Writing clean code using auto formatters and linters

Our base repository comes with all of these set up and installed for you to copy over and use.

Capstone Architecture

13.5 Putting it all together with an exercise

Use this Airflow + dbt + Spark setup to bootstrap your own project, as shown below:

cp -r ./data_engineering_for_beginners_code/airflow ./your-project-name
cd you-project-name
# Update README.md with your specifics
git init
git add .
git commit -m 'First Commit'

Create a new GitHub repo at GitHub Create Repo with the same name as your project.

and follow the steps under …or push an existing repository from the command line

13.6 Exercise: Your Capstone Project

Find a dataset that interests you, showcasing an innovative perspective on the data. Outcome should be shown with data.

Read this article to help you identify a problem space and datasets.

Read this article for more information on formattting a project for hiring managers