%%capture
%%bash
python ./generate_data.py
python ./run_ddl.py1 Read data, Combine tables, & aggregate numbers to understand business performance
1.1 Setup
To run the code, you need to generate the data and load it into Spark tables. Use the script below to do this:
%%sql --show
use prod.db1.2 A Spark catalog can have multiple schemas, & schemas can have multiple tables
In Spark you can have multiple catalogs, each with multiple schemas and each schema with multiple tables.
The hierarchy in modern catalog systems is Catalog → Schema → Table .
%%sql
show catalogs;%%sql
show schemas IN demo;
-- Catalog -> schema%%sql
show schemas IN prod;
-- schema -> namespace%%sql
show tables IN prod.db -- namespace -> Table%%sql --show
select * from prod.db.customer limit 2Note how, when referencing the table name, we use the full path, i.e., schema.table_name. We can skip using the full path of the table if we define which schema to use for the entirety of this session, as shown below.
%%sql --show
use prod.db%%sql
DESCRIBE lineitem%%sql
DESCRIBE extended lineitem1.3 Use SELECT…FROM, LIMIT, WHERE, & ORDER BY to read the required data
The most common use for querying is to read data from our tables. We can do this using a SELECT ... FROM statement, as shown below.
%%sql
-- use * to specify all columns
SELECT
*
FROM
orders
LIMIT
4%%sql
-- use column names to only read data from those columns
SELECT
o_orderkey,
o_totalprice
FROM
orders
LIMIT
4However, running a SELECT ... FROM statement can cause issues when the data set is extensive. If you want to examine the data, use LIMIT n to instruct Trino to retrieve only the first n rows.
We can use the ‘WHERE’ clause to retrieve rows that match specific criteria. We can specify one or more filters within the’ WHERE’ clause. The WHERE clause with more than one filter can use combinations of AND and OR criteria to combine the filter criteria, as shown below.
%%sql
-- all customer rows that have c_nationkey = 20
SELECT
*
FROM
customer
WHERE
c_nationkey = 20
LIMIT
10;%%sql
-- all customer rows that have c_nationkey = 20 and c_acctbal > 1000
SELECT
*
FROM
customer
WHERE
c_nationkey = 20
AND c_acctbal > 1000
LIMIT
10;%%sql
-- all customer rows that have c_nationkey = 20 or c_acctbal > 1000
SELECT
*
FROM
customer
WHERE
c_nationkey = 20
OR c_acctbal > 1000
LIMIT
10;%%sql
-- all customer rows that have (c_nationkey = 20 and c_acctbal > 1000) or rows that have c_nationkey = 11
SELECT
*
FROM
customer
WHERE
(
c_nationkey = 20
AND c_acctbal > 1000
)
OR c_nationkey = 11
LIMIT
10;We can combine multiple filter clauses, as seen above. We have seen examples of equals (=) and greater than (>) conditional operators. There are 6 conditional operators, they are
<Less than>Greater than<=Less than or equal to>=Greater than or equal to=Equal<>and!=both represent Not equal (some DBs only support one of these)
Additionally, for string types, we can make pattern matching with like condition. In a like condition, a _ means any single character, and % means zero or more characters, for example.
%%sql
-- all customer rows where the name has a 381 in it
SELECT
*
FROM
customer
WHERE
c_name LIKE '%381%';%%sql
-- all customer rows where the name ends with a 381
SELECT
*
FROM
customer
WHERE
c_name LIKE '%381';%%sql
-- all customer rows where the name starts with a 381
SELECT
*
FROM
customer
WHERE
c_name LIKE '381%';%%sql
-- all customer rows where the name has a combination of any character and 9 and 1
SELECT
*
FROM
customer
WHERE
c_name LIKE '%_91%';We can also filter for more than one value using IN and NOT IN.
%%sql
-- all customer rows which have nationkey = 10 or nationkey = 20
SELECT
*
FROM
customer
WHERE
c_nationkey IN (10, 20);%%sql
-- all customer rows which have do not have nationkey as 10 or 20
SELECT
*
FROM
customer
WHERE
c_nationkey NOT IN (10, 20);We can get the number of rows in a table using count(*) as shown below.
%%sql
SELECT
COUNT(*)
FROM
customer;
-- 15,000%%sql
SELECT
COUNT(*)
FROM
lineitem;
-- 600,572If we want to get the rows sorted by values in a specific column, we use ORDER BY, for example.
%%sql
-- Will show the first ten customer records with the lowest custkey
-- rows are ordered in ASC order by default
SELECT
*
FROM
orders
ORDER BY
o_custkey
LIMIT
10;%%sql
-- Will show the first ten customer's records with the highest custkey
SELECT
*
FROM
orders
ORDER BY
o_custkey DESC
LIMIT
10;1.4 Combine data from multiple tables using JOINs
We can combine data from multiple tables using joins. When we write a join query, we have a format as shown below.
SELECT
a.*
FROM
table_a a -- LEFT table a
JOIN table_b b -- RIGHT table b
ON a.id = b.idThe table specified first (table_a) is the left table, whereas the table specified second is the right table. When we have multiple tables joined, we consider the joined dataset from the first two tables as the left table and the third table as the right table (The DB optimizes our join for performance).
SELECT
a.*
FROM
table_a a -- LEFT table a
JOIN table_b b -- RIGHT table b
ON a.id = b.id
JOIN table_c c -- LEFT table is the joined data from table_a & table_b, right table is table_c
ON a.c_id = c.idThere are five main types of joins:

1.4.1 1. Inner join (default): Get rows with the same join keys from both tables
%%sql
SELECT
o.o_orderkey,
l.l_orderkey
FROM
orders o
JOIN lineitem l ON o.o_orderkey = l.l_orderkey
AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY
LIMIT
10;%%sql
SELECT
COUNT(o.o_orderkey) AS order_rows_count,
COUNT(l.l_orderkey) AS lineitem_rows_count
FROM
orders o
JOIN lineitem l ON o.o_orderkey = l.l_orderkey
AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY;
-- 24613, 24613Note: JOIN defaults to INNER JOIN`.
The output will contain rows from orders and lineitem that match at least one row from the other table with the specified join condition (same orderkey and orderdate within a 5-day window of the ship date).
We can also see that 2,477 rows from the orders and lineitem tables matched.
1.4.2 2. Left outer join (aka left join): Get all rows from the left table and only matching rows from the right table.
%%sql
SELECT
o.o_orderkey,
l.l_orderkey
FROM
orders o
LEFT JOIN lineitem l ON o.o_orderkey = l.l_orderkey
AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY
LIMIT
10;%%sql
SELECT
COUNT(o.o_orderkey) AS order_rows_count,
COUNT(l.l_orderkey) AS lineitem_rows_count
FROM
orders o
LEFT JOIN lineitem l ON o.o_orderkey = l.l_orderkey
AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY;
-- 151933, 24613The output will include all rows from orders and the rows from lineitem that were able to find at least one matching row from the orders table with the specified join condition (same orderkey and orderdate within a 5-day window of the ship date).
We can also see that the number of rows from the orders table is 15,197 & from the lineitem table is 2,477. The number of rows in orders is 15,000, but the join condition produces 15,197 since some orders match with multiple line items.
1.4.3 3. Right outer join (aka right join): Get matching rows from the left and all rows from the right table.
%%sql
SELECT
o.o_orderkey,
l.l_orderkey
FROM
orders o
RIGHT JOIN lineitem l ON o.o_orderkey = l.l_orderkey
AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY
LIMIT
10;%%sql
SELECT
COUNT(o.o_orderkey) AS order_rows_count,
COUNT(l.l_orderkey) AS lineitem_rows_count
FROM
orders o
RIGHT JOIN lineitem l ON o.o_orderkey = l.l_orderkey
AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY;
-- 24613, 600572The output will include the rows from orders that match at least one row from the lineitem table with the specified join condition (same orderkey and orderdate within a 5-day window of the ship date) and all rows from the lineitem table.
We can also see that the number of rows from the orders table is 15,197 & from the lineitem table is 2,477.
1.4.4 4. Full outer join: Get matched and unmatched rows from both tables.
%%sql
SELECT
o.o_orderkey,
l.l_orderkey
FROM
orders o
FULL OUTER JOIN lineitem l ON o.o_orderkey = l.l_orderkey
AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY
LIMIT
10%%sql
SELECT
COUNT(o.o_orderkey) AS order_rows_count,
COUNT(l.l_orderkey) AS lineitem_rows_count
FROM
orders o
FULL OUTER JOIN lineitem l ON o.o_orderkey = l.l_orderkey
AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY;
-- 151933, 600572The output will include all rows from orders that match at least one row from the lineitem table with the specified join condition (same orderkey and orderdate within a 5-day window of the ship date) and all rows from the lineitem table.
We can also see that the number of rows from the orders table is 15,197 & from the lineitem table is 2,477.
1.4.5 5. Cross join: Join every row in the left table with every row in the right table
%%sql
SELECT
n.n_name AS nation_c_name,
r.r_name AS region_c_name
FROM
nation n
CROSS JOIN region r;The output will have every row of the nation joined with every row of the region. There are 25 nations and five regions, leading to 125 rows in our result from the cross-join.
There are cases where we need to join a table with itself, known as a SELF-join. Let’s consider an example.
- For every customer order, get the order placed earlier in the same week (Sunday - Saturday, not the previous seven days). Only show customer orders that have at least one such order.
%%sql
SELECT
o1.o_custkey as o1_custkey,
o1.o_totalprice as o1_totalprice,
o1.o_orderdate as o1_orderdate,
o2.o_totalprice as o2_totalprice,
o2.o_orderdate as o2_orderdate
FROM
orders o1
JOIN orders o2 ON o1.o_custkey = o2.o_custkey
AND year(o1.o_orderdate) = year(o2.o_orderdate)
AND weekofyear(o1.o_orderdate) = weekofyear(o2.o_orderdate)
WHERE
o1.o_orderkey != o2.o_orderkey
LIMIT
10;1.5 Combine data from multiple rows into one using GROUP BY
Most analytical queries require calculating metrics that involve combining data from multiple rows. GROUP BY allows us to perform aggregate calculations on data from a set of rows recognized by values of specified column(s).
Let’s look at an example question:
- Create a report that shows the number of orders per orderpriority segment.
%%sql
SELECT
o_orderpriority,
COUNT(*) AS num_orders
FROM
orders
GROUP BY
o_orderpriority;In the above query, we group the data by orderpriority, and the calculation count(*) will be applied to the rows having a specific orderpriority value.
The calculations allowed are typically SUM/MIN/MAX/AVG/COUNT. However, some databases have more complex aggregate functions; check your DB documentation.
1.5.1 Use HAVING to filter based on the aggregates created by GROUP BY
If you want to filter based on the values of an aggregate function from a group by, use the having clause. Note that the having clause should come after the group by clause.
%%sql
SELECT
o_orderpriority,
COUNT(*) AS num_orders
FROM
orders
GROUP BY
o_orderpriority
HAVING
COUNT(*) > 3;1.6 Replicate IF.ELSE logic with CASE statements
We can do conditional logic in the SELECT ... FROM part of our query, as shown below.
%%sql
SELECT
o_orderkey,
o_totalprice,
CASE
WHEN o_totalprice > 100000 THEN 'high'
WHEN o_totalprice BETWEEN 25000
AND 100000 THEN 'medium'
ELSE 'low'
END AS order_price_bucket
FROM
orders;We can see how we display different values depending on the totalprice column. We can also use multiple criteria as our conditional criteria (e.g., totalprice > 100000 AND orderpriority = ‘2-HIGH’).
1.7 Stack tables on top of each other with UNION and UNION ALL, subtract tables with EXCEPT
When we want to combine data from tables by stacking them on top of each other, we use the UNION or UNION ALL operator. UNION removes duplicate rows, and UNION ALL does not remove duplicate rows. Let’s look at an example.
%%sql
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%' -- 25 rows%%sql
-- UNION will remove duplicate rows; the below query will produce 25 rows
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE'%_91%'
UNION
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%'
UNION
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91'
LIMIT 10;%%sql
-- UNION ALL will not remove duplicate rows; the below query will produce 75 rows
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%'
UNION ALL
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%'
UNION ALL
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%'
LIMIT 10;When we want to retrieve all rows from the first dataset that are not present in the second dataset, we can use EXCEPT.
%%sql
-- EXCEPT will get the rows in the first query result that is not in the second query result, 0 rows
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%'
EXCEPT
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%'
LIMIT 10;%%sql
-- The below query will result in 23 rows; the first query has 25 rows, and the second has two rows
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE'%_91%'
EXCEPT
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%191%'
LIMIT 10;1.8 Sub-query: Use a query instead of a table
When we want to use the result of a query as a table in another query, we use subqueries. Let’s consider an example:
- Create a report that shows the nation, how many items it supplied (by suppliers in that nation), and how many items it purchased (by customers in that nation).
%%sql
SELECT
n.n_name AS nation_c_name,
s.quantity AS supplied_items_quantity,
c.quantity AS purchased_items_quantity
FROM
nation n
LEFT JOIN (
SELECT
n.n_nationkey,
SUM(l.l_quantity) AS quantity
FROM
lineitem l
JOIN supplier s ON l.l_suppkey = s.s_suppkey
JOIN nation n ON s.s_nationkey = n.n_nationkey
GROUP BY
n.n_nationkey
) s ON n.n_nationkey = s.n_nationkey
LEFT JOIN (
SELECT
n.n_nationkey,
SUM(l.l_quantity) AS quantity
FROM
lineitem l
JOIN orders o ON l.l_orderkey = o.o_orderkey
JOIN customer c ON o.o_custkey = c.c_custkey
JOIN nation n ON c.c_nationkey = n.n_nationkey
GROUP BY
n.n_nationkey
) c ON n.n_nationkey = c.n_nationkey;In the above query, we can see that there are two sub-queries, one to calculate the quantity supplied by a nation and the other to calculate the quantity purchased by the customers of a nation.
1.9 Change data types (CAST) and handle NULLS (COALESCE)
Every column in a table has a specific data type. The data types fall under one of the following categories.
Numerical: Data types used to store numbers.- Integer: Positive and negative numbers. Different types of Integer, such as tinyint, int, and bigint, allow storage of different ranges of values. Integers cannot have decimal digits.
- Floating: These can have decimal digits but store an approximate value.
- Decimal: These can have decimal digits and store the exact value. The decimal type allows you to specify the scale and precision. Where scale denotes the count of numbers allowed as a whole & precision denotes the count of numbers allowed after the decimal point. E.g., DECIMAL(8,3) allows eight numbers in total, with three allowed after the decimal point.
Boolean: Data types used to store True or False values.String: Data types used to store alphanumeric characters.- Varchar(n): Data type allows storage of a variable character string, with a permitted max length n.
- Char(n): Data type allows storage of a fixed character string. A column of char(n) type adds (length(string) - n) empty spaces to a string that does not have n characters.
Date & time: Data types used to store dates, time, & timestamps(date + time).Objects (STRUCT, ARRAY, MAP, JSON): Data types used to store JSON and ARRAY data.
Some databases have data types that are unique to them as well. We should check the database documents to understand the data types offered.
It is best practice to use the appropriate data type for your columns. We can convert data types using the CAST function, as shown below.
A NULL will be used for that field when a value is not present. In cases where we want to use the first non-NULL value from a list of columns, we use COALESCE as shown below.
Let’s consider the following example. We can see how when l.orderkey is NULL, the DB uses 999999 as the output.
%%sql
SELECT
o.o_orderkey,
o.o_orderdate,
COALESCE(l.l_orderkey, 9999999) AS lineitem_orderkey,
l.l_shipdate
FROM
orders o
LEFT JOIN lineitem l ON o.o_orderkey = l.l_orderkey
AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY
AND l.l_shipdate + INTERVAL '5' DAY
LIMIT
10;1.10 Use these standard inbuilt DB functions for String, Time, and Numeric data manipulation
When processing data, more often than not, we will need to change values in columns; shown below are a few standard functions to be aware of:
String functions- LENGTH is used to calculate the length of a string. E.g.,
SELECT LENGTH('hi');will output 2. - CONCAT combines multiple string columns into one. E.g.,
SELECT CONCAT(clerk, '-', orderpriority) FROM ORDERS LIMIT 5;will concatenate clerk and orderpriority columns with a dash in between them. - SPLIT is used to split a value into an array based on a given delimiter. E.g.,
SELECT SPLIT(clerk, '#') FROM ORDERS LIMIT 5;will output a column with arrays formed by splitting clerk values on#. - SUBSTRING is used to get a sub-string from a value, given the start and length. E.g.,
SELECT clerk, SUBSTRING(clerk, 1, 5) FROM orders LIMIT 5;will get the first five characters of the clerk column. Note that indexing starts from 1 in Spark SQL. - TRIM is used to remove empty spaces to the left and right of the value. E.g.,
SELECT TRIM(' hi ');will outputhiwithout any spaces around it. LTRIM and RTRIM are similar but only remove spaces before and after the string, respectively.
- LENGTH is used to calculate the length of a string. E.g.,
Date and Time functionsAdding and subtracting dates: Is used to add and subtract periods; the format heavily depends on the DB. E.g., In Spark SQL, the query
SELECT DATEDIFF(DATE '2023-11-05', DATE '2022-10-01') AS diff_in_days, MONTHS_BETWEEN(DATE '2023-11-05', DATE '2022-10-01') AS diff_in_months, YEAR(DATE '2023-11-05') - YEAR(DATE '2022-10-01') AS diff_in_years;It will show the difference between the two dates in the specified period. We can also add/subtract an arbitrary period from a date/time column. E.g.,
SELECT DATE_ADD(DATE '2022-11-05', 10);will show the output2022-11-15.string <=> date/time conversions: When we want to change the data type of a string to date/time, we can use the
DATE 'YYYY-MM-DD'orTIMESTAMP 'YYYY-MM-DD HH:mm:SS'functions. But when the data is in a different date/time format such asMM/DD/YYYY, we will need to specify the input structure; we do this usingTO_DATEorTO_TIMESTAMP. E.g.SELECT TO_DATE('11-05-2023', 'MM-dd-yyyy');. We can convert a timestamp/date into a string with the required format usingDATE_FORMAT. E.g.,SELECT DATE_FORMAT(orderdate, 'yyyy-MM-01') AS first_month_date FROM orders LIMIT 5;will map every orderdate to the first of their month.Time frame functions (YEAR/MONTH/DAY): When we want to extract specific periods from a date/time column, we can use these functions. E.g.,
SELECT YEAR(DATE '2023-11-05');will return 2023. Similarly, we have MONTH, DAY, HOUR, MINUTE, etc.
Numeric- ROUND is used to specify the number of digits allowed after the decimal point. E.g.
SELECT ROUND(100.102345, 2);
- ROUND is used to specify the number of digits allowed after the decimal point. E.g.
1.11 Save queries as views for more straightforward reads
When we have large/complex queries that we need to run often, we can save them as views. Views are database objects that operate similarly to tables. The OLAP DB executes the underlying query when we query a view.
Use views to hide query complexities and limit column access (by exposing only specific table columns) for end-users.
For example, we can create a view for the nation-level report from the above section, as shown below.
%%sql
DROP VIEW IF EXISTS nation_supplied_purchased_quantity%%sql
CREATE VIEW nation_supplied_purchased_quantity AS
SELECT
n.n_name AS nation_name,
s.quantity AS supplied_items_quantity,
c.quantity AS purchased_items_quantity
FROM
nation n
LEFT JOIN (
SELECT
n_nationkey as nationkey,
sum(l_quantity) AS quantity
FROM
lineitem l
JOIN supplier s ON l.l_suppkey = s.s_suppkey
JOIN nation n ON s.s_nationkey = n.n_nationkey
GROUP BY
n.n_nationkey
) s ON n.n_nationkey = s.nationkey
LEFT JOIN (
SELECT
n_nationkey as nationkey,
sum(l_quantity) AS quantity
FROM
lineitem l
JOIN orders o ON l.l_orderkey = o.o_orderkey
JOIN customer c ON o.o_custkey = c.c_custkey
JOIN nation n ON c.c_nationkey = n.n_nationkey
GROUP BY
n.n_nationkey
) c ON n.n_nationkey = c.nationkey;%%sql
SELECT
*
FROM
nation_supplied_purchased_quantity;Now the view nation_supplied_purchased_quantity will run the underlying query when used.
1.12 Exercises
- Write a query that shows the number of items returned for each region name
- List the top 10 most selling parts (part name)
- Sellers (name) who have sold at least one of the top 10 selling parts
- Number of items returned for each order price bucket. The definition of order price bucket is shown below.
CASE
WHEN o_totalprice > 100000 THEN 'high'
WHEN o_totalprice BETWEEN 25000 AND 100000 THEN 'medium'
ELSE 'low'
END AS order_price_bucket- Average time (in days) between receiptdate and shipdate for each nation (name)
Here is the data model: 