Scheduler defines when & Orchestrator defines how to, run your data pipelines
Schedulers define when to start your data pipeline, such as cron or Airflow.
Orchestrators define the order in which the tasks of a data pipeline should run. For example, extract before transform, complex branching logic, and executing across multiple systems, such as Spark and Snowflake. E.g., dbt-core, Airflow, etc
Our Airflow, dbt, and capstone project infrastructure is in a separate folder to keep our setup simple. When you are in the project directory, stop any running container as shown below.
data_engineering_for_beginners_code/> docker compose down
data_engineering_for_beginners_code/> cd airflow
data_engineering_for_beginners_code/airflow> make restart
You can open Airflow UI at http://localhost:8080 and log in with airflow
as username and password. In the Airflow UI, you can run the dag.
After the dag is run, in the terminal, run make dbt-docs
for dbt to serve the docs, which is viewable by going to http://localhost:8081.
You can stop the containers & return to the parent directory as shown below:
make down
cd ..
The Makefile
contains a list of shortcuts for lengthy commands. Let’s look at our Makefile below.
####################################################################################################################
# Setup containers to run Airflow
docker-spin-up:
docker compose build && docker compose up airflow-init && docker compose up --build -d
perms:
sudo mkdir -p logs plugins temp dags tests data visualization && sudo chmod -R u=rwx,g=rwx,o=rwx logs plugins temp dags tests data visualization tpch_analytics
do-sleep:
sleep 30
up: perms docker-spin-up do-sleep
down:
docker compose down
restart: down up
sh:
docker exec -ti scheduler bash
dbt-docs:
docker exec -d webserver bash -c "cd /opt/airflow/tpch_analytics && nohup dbt docs serve --host 0.0.0.0 --port 8081 > /tmp/dbt_docs.log 2>&1"
We can see how long complex commands can be aliased to short make commands
, which can be run as make command