%%capture
%%bash
/generate_data.py
python ./run_ddl.py python .
1 Read data, Combine tables, & aggregate numbers to understand business performance
1.1 Setup
To run the code, you need to generate the data and load it into Spark tables. Use the script below to do this:
%%sql --show
use prod.db
1.2 A Spark catalog can have multiple schemas, & schemas can have multiple tables
In Spark you can have multiple catalogs, each with multiple schemas and each schema with multiple tables.
The hierarchy in modern catalog systems is Catalog → Schema → Table
.
%%sql
; show catalogs
%%sql
;
show schemas IN demo
-- Catalog -> schema
%%sql
;
show schemas IN prod
-- schema -> namespace
%%sql
-- namespace -> Table show tables IN prod.db
%%sql --show
* from prod.db.customer limit 2 select
Note how, when referencing the table name, we use the full path, i.e., schema.table_name
. We can skip using the full path of the table if we define which schema to use for the entirety of this session, as shown below.
%%sql --show
use prod.db
%%sql
DESCRIBE lineitem
%%sql
DESCRIBE extended lineitem
1.3 Use SELECT…FROM, LIMIT, WHERE, & ORDER BY to read the required data
The most common use for querying is to read data from our tables. We can do this using a SELECT ... FROM
statement, as shown below.
%%sql
-- use * to specify all columns
SELECT*
FROM
orders
LIMIT4
%%sql
-- use column names to only read data from those columns
SELECT
o_orderkey,
o_totalprice
FROM
orders
LIMIT4
However, running a SELECT ... FROM
statement can cause issues when the data set is extensive. If you want to examine the data, use LIMIT n
to instruct Trino to retrieve only the first n rows.
We can use the ‘WHERE’ clause to retrieve rows that match specific criteria. We can specify one or more filters within the’ WHERE’ clause. The WHERE
clause with more than one filter can use combinations of AND
and OR
criteria to combine the filter criteria, as shown below.
%%sql
-- all customer rows that have c_nationkey = 20
SELECT*
FROM
customer
WHERE= 20
c_nationkey
LIMIT10;
%%sql
-- all customer rows that have c_nationkey = 20 and c_acctbal > 1000
SELECT*
FROM
customer
WHERE= 20
c_nationkey > 1000
AND c_acctbal
LIMIT10;
%%sql
-- all customer rows that have c_nationkey = 20 or c_acctbal > 1000
SELECT*
FROM
customer
WHERE= 20
c_nationkey > 1000
OR c_acctbal
LIMIT10;
%%sql
-- all customer rows that have (c_nationkey = 20 and c_acctbal > 1000) or rows that have c_nationkey = 11
SELECT*
FROM
customer
WHERE
(= 20
c_nationkey > 1000
AND c_acctbal
)= 11
OR c_nationkey
LIMIT10;
We can combine multiple filter clauses, as seen above. We have seen examples of equals (=
) and greater than (>
) conditional operators. There are 6 conditional operators, they are
<
Less than>
Greater than<=
Less than or equal to>=
Greater than or equal to=
Equal<>
and!=
both represent Not equal (some DBs only support one of these)
Additionally, for string types, we can make pattern matching with like
condition. In a like
condition, a _
means any single character, and %
means zero or more characters, for example.
%%sql
-- all customer rows where the name has a 381 in it
SELECT*
FROM
customer
WHERE'%381%'; c_name LIKE
%%sql
-- all customer rows where the name ends with a 381
SELECT*
FROM
customer
WHERE'%381'; c_name LIKE
%%sql
-- all customer rows where the name starts with a 381
SELECT*
FROM
customer
WHERE'381%'; c_name LIKE
%%sql
-- all customer rows where the name has a combination of any character and 9 and 1
SELECT*
FROM
customer
WHERE'%_91%'; c_name LIKE
We can also filter for more than one value using IN
and NOT IN
.
%%sql
-- all customer rows which have nationkey = 10 or nationkey = 20
SELECT*
FROM
customer
WHERE10, 20); c_nationkey IN (
%%sql
-- all customer rows which have do not have nationkey as 10 or 20
SELECT*
FROM
customer
WHERE10, 20); c_nationkey NOT IN (
We can get the number of rows in a table using count(*)
as shown below.
%%sql
SELECT*)
COUNT(
FROM;
customer
-- 1500
%%sql
SELECT*)
COUNT(
FROM;
lineitem
-- 60175
If we want to get the rows sorted by values in a specific column, we use ORDER BY
, for example.
%%sql
-- Will show the first ten customer records with the lowest custkey
-- rows are ordered in ASC order by default
SELECT*
FROM
orders
ORDER BY
o_custkey
LIMIT10;
%%sql
-- Will show the first ten customer's records with the highest custkey
SELECT
*
FROM
orders
ORDER BY
o_custkey DESC
LIMIT10;
1.4 Combine data from multiple tables using JOINs
We can combine data from multiple tables using joins. When we write a join query, we have a format as shown below.
SELECT
*
a.FROM
-- LEFT table a
table_a a JOIN table_b b -- RIGHT table b
ON a.id = b.id
The table specified first (table_a) is the left table, whereas the table specified second is the right table. When we have multiple tables joined, we consider the joined dataset from the first two tables as the left table and the third table as the right table (The DB optimizes our join for performance).
SELECT
*
a.FROM
-- LEFT table a
table_a a JOIN table_b b -- RIGHT table b
ON a.id = b.id
JOIN table_c c -- LEFT table is the joined data from table_a & table_b, right table is table_c
ON a.c_id = c.id
There are five main types of joins:
1.4.1 1. Inner join (default): Get rows with the same join keys from both tables
%%sql
SELECT
o.o_orderkey,
l.l_orderkey
FROM
orders o= l.l_orderkey
JOIN lineitem l ON o.o_orderkey - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY
AND o.o_orderdate BETWEEN l.l_shipdate
LIMIT10;
%%sql
SELECT
COUNT(o.o_orderkey) AS order_rows_count,
COUNT(l.l_orderkey) AS lineitem_rows_count
FROM
orders o= l.l_orderkey
JOIN lineitem l ON o.o_orderkey - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY;
AND o.o_orderdate BETWEEN l.l_shipdate -- 2477, 2477
Note: JOIN
defaults to INNER JOIN`.
The output will contain rows from orders and lineitem that match at least one row from the other table with the specified join condition (same orderkey and orderdate within a 5-day window of the ship date).
We can also see that 2,477 rows from the orders and lineitem tables matched.
1.4.2 2. Left outer join (aka left join): Get all rows from the left table and only matching rows from the right table.
%%sql
SELECT
o.o_orderkey,
l.l_orderkey
FROM
orders o= l.l_orderkey
LEFT JOIN lineitem l ON o.o_orderkey - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY
AND o.o_orderdate BETWEEN l.l_shipdate
LIMIT10;
%%sql
SELECT
COUNT(o.o_orderkey) AS order_rows_count,
COUNT(l.l_orderkey) AS lineitem_rows_count
FROM
orders o= l.l_orderkey
LEFT JOIN lineitem l ON o.o_orderkey - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY;
AND o.o_orderdate BETWEEN l.l_shipdate -- 15197, 2477
The output will include all rows from orders and the rows from lineitem that were able to find at least one matching row from the orders table with the specified join condition (same orderkey and orderdate within a 5-day window of the ship date).
We can also see that the number of rows from the orders table is 15,197 & from the lineitem table is 2,477. The number of rows in orders is 15,000, but the join condition produces 15,197 since some orders match with multiple line items.
1.4.3 3. Right outer join (aka right join): Get matching rows from the left and all rows from the right table.
%%sql
SELECT
o.o_orderkey,
l.l_orderkey
FROM
orders o= l.l_orderkey
RIGHT JOIN lineitem l ON o.o_orderkey - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY
AND o.o_orderdate BETWEEN l.l_shipdate
LIMIT10;
%%sql
SELECT
COUNT(o.o_orderkey) AS order_rows_count,
COUNT(l.l_orderkey) AS lineitem_rows_count
FROM
orders o= l.l_orderkey
RIGHT JOIN lineitem l ON o.o_orderkey - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY;
AND o.o_orderdate BETWEEN l.l_shipdate -- 2477, 60175
The output will include the rows from orders that match at least one row from the lineitem table with the specified join condition (same orderkey and orderdate within a 5-day window of the ship date) and all rows from the lineitem table.
We can also see that the number of rows from the orders table is 15,197 & from the lineitem table is 2,477.
1.4.4 4. Full outer join: Get matched and unmatched rows from both tables.
%%sql
SELECT
o.o_orderkey,
l.l_orderkey
FROM
orders o= l.l_orderkey
FULL OUTER JOIN lineitem l ON o.o_orderkey - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY
AND o.o_orderdate BETWEEN l.l_shipdate
LIMIT10
%%sql
SELECT
COUNT(o.o_orderkey) AS order_rows_count,
COUNT(l.l_orderkey) AS lineitem_rows_count
FROM
orders o= l.l_orderkey
FULL OUTER JOIN lineitem l ON o.o_orderkey - INTERVAL '5' DAY AND l.l_shipdate + INTERVAL '5' DAY;
AND o.o_orderdate BETWEEN l.l_shipdate -- 15197, 60175
The output will include all rows from orders that match at least one row from the lineitem table with the specified join condition (same orderkey and orderdate within a 5-day window of the ship date) and all rows from the lineitem table.
We can also see that the number of rows from the orders table is 15,197 & from the lineitem table is 2,477.
1.4.5 5. Cross join: Join every row in the left table with every row in the right table
%%sql
SELECT
n.n_name AS nation_c_name,
r.r_name AS region_c_name
FROM
nation n; CROSS JOIN region r
The output will have every row of the nation joined with every row of the region. There are 25 nations and five regions, leading to 125 rows in our result from the cross-join.
There are cases where we need to join a table with itself, known as a SELF-join. Let’s consider an example.
- For every customer order, get the order placed earlier in the same week (Sunday - Saturday, not the previous seven days). Only show customer orders that have at least one such order.
%%sql
SELECTas o1_custkey,
o1.o_custkey as o1_totalprice,
o1.o_totalprice as o1_orderdate,
o1.o_orderdate as o2_totalprice,
o2.o_totalprice as o2_orderdate
o2.o_orderdate
FROM
orders o1= o2.o_custkey
JOIN orders o2 ON o1.o_custkey = year(o2.o_orderdate)
AND year(o1.o_orderdate) = weekofyear(o2.o_orderdate)
AND weekofyear(o1.o_orderdate)
WHERE!= o2.o_orderkey
o1.o_orderkey
LIMIT10;
1.5 Combine data from multiple rows into one using GROUP BY
Most analytical queries require calculating metrics that involve combining data from multiple rows. GROUP BY
allows us to perform aggregate calculations on data from a set of rows recognized by values of specified column(s).
Let’s look at an example question:
- Create a report that shows the number of orders per orderpriority segment.
%%sql
SELECT
o_orderpriority,*) AS num_orders
COUNT(
FROM
orders
GROUP BY; o_orderpriority
In the above query, we group the data by orderpriority
, and the calculation count(*)
will be applied to the rows having a specific orderpriority
value.
The calculations allowed are typically SUM/MIN/MAX/AVG/COUNT. However, some databases have more complex aggregate functions; check your DB documentation.
1.5.1 Use HAVING to filter based on the aggregates created by GROUP BY
If you want to filter based on the values of an aggregate function from a group by, use the having
clause. Note that the having
clause should come after the group by
clause.
%%sql
SELECT
o_orderpriority,*) AS num_orders
COUNT(
FROM
orders
GROUP BY
o_orderpriority
HAVING*) > 3; COUNT(
1.6 Replicate IF.ELSE logic with CASE statements
We can do conditional logic in the SELECT ... FROM
part of our query, as shown below.
%%sql
SELECT
o_orderkey,
o_totalprice,
CASE> 100000 THEN 'high'
WHEN o_totalprice 25000
WHEN o_totalprice BETWEEN 100000 THEN 'medium'
AND 'low'
ELSE
END AS order_price_bucket
FROM; orders
We can see how we display different values depending on the totalprice
column. We can also use multiple criteria as our conditional criteria (e.g., totalprice > 100000 AND orderpriority = ‘2-HIGH’).
1.7 Stack tables on top of each other with UNION and UNION ALL, subtract tables with EXCEPT
When we want to combine data from tables by stacking them on top of each other, we use the UNION or UNION ALL operator. UNION
removes duplicate rows, and UNION ALL
does not remove duplicate rows. Let’s look at an example.
%%sql
'%_91%' -- 25 rows SELECT c_custkey, c_name FROM customer WHERE c_name LIKE
%%sql
-- UNION will remove duplicate rows; the below query will produce 25 rows
'%_91%'
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE
UNION'%_91%'
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE
UNION'%_91'
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE 10; LIMIT
%%sql
-- UNION ALL will not remove duplicate rows; the below query will produce 75 rows
'%_91%'
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE
UNION ALL'%_91%'
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE
UNION ALL'%_91%'
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE 10; LIMIT
When we want to retrieve all rows from the first dataset that are not present in the second dataset, we can use EXCEPT
.
%%sql
-- EXCEPT will get the rows in the first query result that is not in the second query result, 0 rows
'%_91%'
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE
EXCEPT'%_91%'
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE 10; LIMIT
%%sql
-- The below query will result in 23 rows; the first query has 25 rows, and the second has two rows
'%_91%'
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE
EXCEPT'%191%'
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE 10; LIMIT
1.8 Sub-query: Use a query instead of a table
When we want to use the result of a query as a table in another query, we use subqueries. Let’s consider an example:
- Create a report that shows the nation, how many items it supplied (by suppliers in that nation), and how many items it purchased (by customers in that nation).
%%sql
SELECT
n.n_name AS nation_c_name,
s.quantity AS supplied_items_quantity,
c.quantity AS purchased_items_quantity
FROM
nation n
LEFT JOIN (
SELECT
n.n_nationkey,
SUM(l.l_quantity) AS quantity
FROM
lineitem l= s.s_suppkey
JOIN supplier s ON l.l_suppkey = n.n_nationkey
JOIN nation n ON s.s_nationkey
GROUP BY
n.n_nationkey= s.n_nationkey
) s ON n.n_nationkey
LEFT JOIN (
SELECT
n.n_nationkey,
SUM(l.l_quantity) AS quantity
FROM
lineitem l= o.o_orderkey
JOIN orders o ON l.l_orderkey = c.c_custkey
JOIN customer c ON o.o_custkey = n.n_nationkey
JOIN nation n ON c.c_nationkey
GROUP BY
n.n_nationkey= c.n_nationkey; ) c ON n.n_nationkey
In the above query, we can see that there are two sub-queries, one to calculate the quantity supplied by a nation and the other to calculate the quantity purchased by the customers of a nation.
1.9 Change data types (CAST) and handle NULLS (COALESCE)
Every column in a table has a specific data type. The data types fall under one of the following categories.
Numerical
: Data types used to store numbers.- Integer: Positive and negative numbers. Different types of Integer, such as tinyint, int, and bigint, allow storage of different ranges of values. Integers cannot have decimal digits.
- Floating: These can have decimal digits but store an approximate value.
- Decimal: These can have decimal digits and store the exact value. The decimal type allows you to specify the scale and precision. Where scale denotes the count of numbers allowed as a whole & precision denotes the count of numbers allowed after the decimal point. E.g., DECIMAL(8,3) allows eight numbers in total, with three allowed after the decimal point.
Boolean
: Data types used to store True or False values.String
: Data types used to store alphanumeric characters.- Varchar(n): Data type allows storage of a variable character string, with a permitted max length n.
- Char(n): Data type allows storage of a fixed character string. A column of char(n) type adds (length(string) - n) empty spaces to a string that does not have n characters.
Date & time
: Data types used to store dates, time, & timestamps(date + time).Objects (STRUCT, ARRAY, MAP, JSON)
: Data types used to store JSON and ARRAY data.
Some databases have data types that are unique to them as well. We should check the database documents to understand the data types offered.
It is best practice to use the appropriate data type for your columns. We can convert data types using the CAST
function, as shown below.
A NULL
will be used for that field when a value is not present. In cases where we want to use the first non-NULL value from a list of columns, we use COALESCE
as shown below.
Let’s consider the following example. We can see how when l.orderkey
is NULL, the DB uses 999999
as the output.
%%sql
SELECT
o.o_orderkey,
o.o_orderdate,9999999) AS lineitem_orderkey,
COALESCE(l.l_orderkey,
l.l_shipdate
FROM
orders o= l.l_orderkey
LEFT JOIN lineitem l ON o.o_orderkey - INTERVAL '5' DAY
AND o.o_orderdate BETWEEN l.l_shipdate + INTERVAL '5' DAY
AND l.l_shipdate
LIMIT10;
1.10 Use these standard inbuilt DB functions for String, Time, and Numeric data manipulation
When processing data, more often than not, we will need to change values in columns; shown below are a few standard functions to be aware of:
String functions
- LENGTH is used to calculate the length of a string. E.g.,
SELECT LENGTH('hi');
will output 2. - CONCAT combines multiple string columns into one. E.g.,
SELECT CONCAT(clerk, '-', orderpriority) FROM ORDERS LIMIT 5;
will concatenate clerk and orderpriority columns with a dash in between them. - SPLIT is used to split a value into an array based on a given delimiter. E.g.,
SELECT SPLIT(clerk, '#') FROM ORDERS LIMIT 5;
will output a column with arrays formed by splitting clerk values on#
. - SUBSTRING is used to get a sub-string from a value, given the start and length. E.g.,
SELECT clerk, SUBSTRING(clerk, 1, 5) FROM orders LIMIT 5;
will get the first five characters of the clerk column. Note that indexing starts from 1 in Spark SQL. - TRIM is used to remove empty spaces to the left and right of the value. E.g.,
SELECT TRIM(' hi ');
will outputhi
without any spaces around it. LTRIM and RTRIM are similar but only remove spaces before and after the string, respectively.
- LENGTH is used to calculate the length of a string. E.g.,
Date and Time functions
Adding and subtracting dates: Is used to add and subtract periods; the format heavily depends on the DB. E.g., In Spark SQL, the query
SELECT DATE '2023-11-05', DATE '2022-10-01') AS diff_in_days, DATEDIFF(MONTHS_BETWEEN(DATE '2023-11-05', DATE '2022-10-01') AS diff_in_months, YEAR(DATE '2023-11-05') - YEAR(DATE '2022-10-01') AS diff_in_years;
It will show the difference between the two dates in the specified period. We can also add/subtract an arbitrary period from a date/time column. E.g.,
SELECT DATE_ADD(DATE '2022-11-05', 10);
will show the output2022-11-15
.string <=> date/time conversions: When we want to change the data type of a string to date/time, we can use the
DATE 'YYYY-MM-DD'
orTIMESTAMP 'YYYY-MM-DD HH:mm:SS'
functions. But when the data is in a different date/time format such asMM/DD/YYYY
, we will need to specify the input structure; we do this usingTO_DATE
orTO_TIMESTAMP
. E.g.SELECT TO_DATE('11-05-2023', 'MM-dd-yyyy');
. We can convert a timestamp/date into a string with the required format usingDATE_FORMAT
. E.g.,SELECT DATE_FORMAT(orderdate, 'yyyy-MM-01') AS first_month_date FROM orders LIMIT 5;
will map every orderdate to the first of their month.Time frame functions (YEAR/MONTH/DAY): When we want to extract specific periods from a date/time column, we can use these functions. E.g.,
SELECT YEAR(DATE '2023-11-05');
will return 2023. Similarly, we have MONTH, DAY, HOUR, MINUTE, etc.
Numeric
- ROUND is used to specify the number of digits allowed after the decimal point. E.g.
SELECT ROUND(100.102345, 2);
- ROUND is used to specify the number of digits allowed after the decimal point. E.g.
1.11 Save queries as views for more straightforward reads
When we have large/complex queries that we need to run often, we can save them as views. Views are database objects that operate similarly to tables. The OLAP DB executes the underlying query when we query a view.
Use views to hide query complexities and limit column access (by exposing only specific table columns) for end-users.
For example, we can create a view for the nation-level report from the above section, as shown below.
%%sql
DROP VIEW IF EXISTS nation_supplied_purchased_quantity
%%sql
CREATE VIEW nation_supplied_purchased_quantity AS
SELECT
n.n_name AS nation_name,
s.quantity AS supplied_items_quantity,
c.quantity AS purchased_items_quantity
FROM
nation n
LEFT JOIN (
SELECTas nationkey,
n_nationkey sum(l_quantity) AS quantity
FROM
lineitem l= s.s_suppkey
JOIN supplier s ON l.l_suppkey = n.n_nationkey
JOIN nation n ON s.s_nationkey
GROUP BY
n.n_nationkey= s.nationkey
) s ON n.n_nationkey
LEFT JOIN (
SELECTas nationkey,
n_nationkey sum(l_quantity) AS quantity
FROM
lineitem l= o.o_orderkey
JOIN orders o ON l.l_orderkey = c.c_custkey
JOIN customer c ON o.o_custkey = n.n_nationkey
JOIN nation n ON c.c_nationkey
GROUP BY
n.n_nationkey= c.nationkey; ) c ON n.n_nationkey
%%sql
SELECT*
FROM; nation_supplied_purchased_quantity
Now the view nation_supplied_purchased_quantity
will run the underlying query when used.
1.12 Exercises
- Write a query that shows the number of items returned for each region name
- List the top 10 most selling parts (part name)
- Sellers (name) who have sold at least one of the top 10 selling parts
- Number of items returned for each order price bucket. The definition of order price bucket is shown below.
CASE
WHEN o_totalprice > 100000 THEN 'high'
WHEN o_totalprice BETWEEN 25000 AND 100000 THEN 'medium'
ELSE 'low'
END AS order_price_bucket
- Average time (in days) between receiptdate and shipdate for each nation (name)
Here is the data model: