These notes are meant to be a refresher on SQL basics assuming that the reader already has some basic understanding of SQL.
graph TD
id1[FROM + JOIN] --> WHERE --> id2[GROUP BY] --> HAVING --> SELECT --> id3[ORDER BY] --> LIMIT
- Double quotes “” if there is an apostrophe in the text
- How to escape apostrophe (Postgres)
- Macy’s —> Macy’’s
- How to escape apostrophe (Postgres)
Jan 1, 2016 - Dec 31, 2016
BETWEEN '2016-01-01' AND '2017-01-01'
PK → Exists in every table; is unique
FK → Can have duplicates in other tables
1NF → All unique records have their own rows → no keys
2NF → All columns are determined only by PKs → PK only
3NF → All columns are determined by PK/FK → PK/FK (aka no transitive dependency)
4NF → All 1:n dependency → PK/FK for every column
- many-to-many → not allowed in traditional databases; breaks down the schema
- Multiple author_id columns for each book in the Books table → X
6912 rows
**--** **includes NULLs for rows where sales_id <> 3250**
SELECT ...
FROM orders o
LEFT JOIN accounts a ON o.acconut_id = a.id AND a.sales_id = 3250
134 rows
**-- no nulls because non-matching rows are filtered out after JOIN**
SELECT ...
FROM orders o
LEFT JOIN accounts a ON o.acconut_id = a.id
WHERE a.sales_id = 3250
COUNT DISTINCT → very expensive
DISTINCT + aggregations → slow
- Ignores null values during calculation
- Can be applied on any data type
- Ignores null values during calculation
- Works only with numeric columns
- Ignores null values during calculation
- Works on non-numeric data
- MIN/MAX (date, string, number)
- MIN(date) → earliest date
- MIN(string) → closest to ‘z’
- MAX(date) → latest date
- MAX(string) → closes to ‘a’
- MIN/MAX (date, string, number)
Column-wise → use aggregation funcitons
Row-wise → use arithmetic operators e.g. + - * /
- GROUP BY dates → not very useful
- Placed between WHERE and ORDER BY
- How to use LIMIT with GROUP BY
- GROUP BY first, LIMIT afterwards
- After you GROUP BY a column, if you end up with more unique values that the LIMIT imposed e.g. you get 15 unique rows but you are using LIMIT 10, then LIMIT will randomly cut out 5 rows
- Order of column names in GROUP BY
- The order of column names in the GROUP BY clause doesn’t matter — the results will be the same regardless
- To keep track of which column we are GROUP-ing and ORDER-ing BY with
- It reduces mistakes that can be made while typing a column name
- Saves time by not having two write entire aggregate functions in GROUP/ORDER BY
Single row output → GROUP BY required
SELECT total_amt_usd,
COUNT(1) AS total_count
FROM orders
GROUP BY 1
Append aggregated result to every row → GROUP BY not required
SELECT total_amt_usd,
(SELECT COUNT(1) FROM orders) AS total_count
FROM orders
- Find earliest occurrence
- Find most recent occurrence
Edge case
When finding the earliest or most recent events or smallest, you can use MIN(). However, there is an edge case where you may end up with ties. For example, if multiple orders were placed on exactly the same second or even microsecond (can easily happen on Amazon on Black Friday or Boxing Day with a millions of orders placed on a single day), then trying to find the earliest or most recent event will show only 1 result and omit the rest. However, it's not correct. To find all such orders, you can use the RANK() OVER (ORDER BY )
window function and retain all orders with rank = 1
.
- Cleaner way to filter aggregated data
- Alternative → subqueries
- The HAVING clause cannot have aliases
- You need to use the actual column names
- You need to write the entire aggregation operation that you may perform over the column
- common → day, month, year
- all → microsecond, millisecond, second, minute, hour, day, week, month, quarter, year, decade, century, millennium
- 2015-01-06
- decade → 2010-01-01T…
- century → 2001-01-01T…
- millennium → 2001-01-01T…
- 1995-01-06
- century → 1901…
- millennium → 1001…
- 2015-01-06
- all of DATE_TRUNC functions +
- dow, doy, epoch (seconds since 1970-01-01, can be -ve)
AGE, EXTRACT, OVERLAP (boolean), CURRENT_DATE (includes timezone), CURRENT_TIMESTAMP (aka NOW() )
CASE WHEN std_qty = 0 OR std_usd = 15 THEN 0
ELSE std_qty/100 END AS metric
-- else is optional
- Always inside SELECT
- Can include multiple WHEN and ELSE
- Can convert aggregations shown in rows to columns (almost like PIVOT)
- Allows multiple WHERE clause-like execution whereas you can have only 1 WHERE clause
graph TD
A[Subqueries]
A --> B[Placement]
A --> C[Type]
B --> |CTE| D[WITH]
B --> |correlated| E[nested]
B --> |FROM| F[inline]
B --> |value| G[scalar]
C --> H[simple]
C --> |WHERE| I[correlated]
- Nested & inline → doesn’t need alias (often)
- Scalar
- multiple matches → error
- no match → NULL
- Correlated subqueries → dependent on outer query
- Manipulation / aggregation is done before JOIN
- Can be run independently
- Doesn’t stitch tables using a key (unlike JOIN)
- Query plans are similar to JOIN
CREATE VIEW view_name
AS
SELECT *
FROM table1 t1
JOIN table2 t2 ON t2.id = t1.id
WHERE t1.col = value
- UPDATE may not be possible
- INSERT / DELETE rows of base table
- only if VIEW is created from a single base table (no JOIN)
- Alter VIEW definition
- Yes, can be done on Oracle or IBM DB2
CREATE OR REPLACE VIEW
- Yes, can be done on Oracle or IBM DB2
CAST (column as type)
column::type
Data types → decimal, integer, char, varchar, double, double precision, boolean, timestamp, interval etc.
(string, number_of_chars)
(string, start, length)
(substring IN string) → returns the 1st occurrence
→ When a column holds a lot of text
(string, substring)
→ When every column has raw text data
(string, string_to_be_replaced, string_to_replace_with)
(str1, str2, str3,…)
- Returns the first non-null value in a list
- Can be used to combine columns where only 1 of them has data and the rest are NULL
- Usually indicator columns
- apartment, amenity, exterior → (2bhk, NULL, NULL) → 2bhk
- apartment, amenity, exterior → (NULL, Gym, NULL) → Gym
- apartment, amenity, exterior → (NULL, NULL, exterior) → exterior
- apartment, amenity, exterior → (NULL, Gym, exterior) → Gym
- Usually indicator columns
- Use COALESCE
- DROP records
- Impute
- Often using median or mode
- Normalization → standardize or clean up
- Adjusting a varied distribution → z-score transformation
- Retains all records
- Included in SELECT
- Cannot be included in GROUP BY
- No ORDER BY ⇒
ORDER BY 0
⇒ORDER BY NULL
If you want to filter by RANK in a WHERE clause, you need to first RANK items in a subquery or using a CTE.
Ranks are not skipped for ties
SELECT col1,
col2,
RANK() OVER monthly_window AS rank
FROM table
WHERE col1 = [value]
WINDOW monthly_window AS (PARTITION BY col2 ORDER BY col3)
ORDER BY col2
SUM(col) OVER (ORDER BY date)
AVG(col) OVER (ORDER BY date)