My note for Complete SQL and Databases Bootcamp: Zero to Mastery course
Skills: SQL
, PostgreSQL
, Database
, ER Diagram
, RMDB Normalization
, Redis
- Relational (SQL)
- Document (MongoDB)
- Key Value (DynamoDB)
- Graph (Neo4j)
- Wide Columnar
SELECT *
FROM USERS
-
Imperative: How it will happen?
- go line by line of instruction to tell exactly what we want program to do
- Java, Python
- more flexible bit more complicated
-
Declarative: What will happen?
- more abstract. we just say "give me this"
- simple but less flexible
- SQL, Python
- Python can be both imperative and declarative
flowchart LR
User-->Computer
Computer-->SQL
SQL-->DBMS
DBMS-->Database
- SQL (Structured Query Language) is abstract layer of DBMS (database management) and database
- Each DBMS have their own model
- A way to organize and store data
- e.g., Hierarchical, Networking, Entity-Relationship, Relational ** (most popular), Object Oriented, Flat, Semi-Structured etc.
- Old database model used by IBM in the 60s and 70s
- Not popular anymore due to inefficiencies
- tight coupling (child node depend on parent node)
- support for one-to-many relationship
- Example in XML
<Author>
<Mo>
<Name>Mo Binni</Name>
<Country>Canada</Country>
<Book1>
<Released>01/01/1990</Released>
</Book1>
<Book2>
<Released>01/01/1993</Released>
</Book2>
</Mo>
<Name>Andrei Neagoie</Name>
<Country>Canada</Country>
<Book1>
<Released>01/01/1990</Released>
</Book1>
<Book2>
<Released>01/01/1993</Released>
</Book2>
</Author>
- expanded on the hierarchical model allowing many-to-many relationship
- Example in XML
<Author>
<Mo>
<Name>Mo Binni</Name>
<Country>Canada</Country>
<Book1 author="Andrei" relation="co-author" />
<Book2>
<Released>01/01/1993</Released>
</Book2>
</Mo>
<Name>Andrei Neagoie</Name>
<Country>Canada</Country>
<Book1>
<Released>01/01/1990</Released>
</Book1>
<Book2>
<Released>01/01/1993</Released>
</Book2>
</Author>
flowchart LR
Author-->Book
- CRUD operations
- Manage data, Secure data, Transaction data
- e.g., Microsoft SQL Server, IBM, MySQL, Oracle, PostgreSQL
- 12 rules of CODD (https://www.w3resource.com/sql/sql-basic/codd-12-rule-relation.php)
- Relation Schema
- Attribute
- Degree
- Cardinality
- Tuple
- Column
- Relation Key
- Domain
- Tables
- Relation Instance
- Example
- Column / Attribute = one column
- Degree = Many columns
- Domain / Constraint = limitation on data type in a column
- dob can store datetime
- sex can store 1 char 'm' or 'f'
- Row / Tuple
- Cardinality = many row
- primary key : uniquely identify data
- foreign key : primary key of the different table
- OLTP (Online Transaction Processing): support day to day
- OLAP (Online Analytical Processing): support analysis
- DCL (Data Control Language) :
GRANT
,REVOKE
- DDL (Data Definition Language) :
CREATE
,ALTER
,DROP
,RENAME
,TRUNCATE
,COMMENT
- DQL (Data Query Language) :
SELECT
- DML (Data Modification Language) :
INSERT
,UPDATE
,DELETE
,MERGE
,CALL
,EXPLAIN PLAN
,LOCK TABLE
-
Aggregate: operate many records to produce 1 value
AVG()
,COUNT()
,MIN()
,MAX()
,SUM()
-
Scalar: operate on each record independently
CONCAT
- WHERE
-
AND, OR, NOT
-
Operation: >, <, <=, >=, =, !=
-
Order of operations: FROM -> WHERE -> SELECT
-
Operator Precedence (priority of operators) If same priority, operate (left to right) or (right to left)
- Parentheses
- Multiplication / Division
- Subtraction / Addition
- NOT
- AND
- OR
-
Null represent missing/empty value
-
What ever we do with null, it always be null
-
1 = 1 (true), 1 != 1 (false)
-
null = null (null), null <> null (null)
- filter out null : use
is
instead of!=
SELECT * FROM <table>
WHERE <field> IS [NOT] NULL
- replace null
SELECT COALESCE(<column>, 'Empty') AS column_alias
FROM <table>
SELECT COALESCE(<column1>, <column2>, <column3>, 'Empty') AS column_alias
FROM <table>
- Logical Expression in sQL can be TRUE, FALSE, UNKNOWN
SELECT <column>
FROM <table>
WHERE <column> BETWEEN x AND y
-- equivalent to
-- WHERE <column> >= X and <column> <= Y
SELECT *
FROM <table>
WHERE <column> IN (value1, value2)
-- equivalent to
-- WHERE <column> == value1 or <column> == value2
- partial lookup
SELECT first_name FROM employees
FROM first_name LIKE 'M%'
M%
: string start with M%
: Any number of character_
: 1 character- Cast : changing something to something else
- must cast to text to use with like
CAST(salary as text);
salary::text
- case insensitive match
name ILIKE 'MO%'
- match string start with MO, Mo, mO. mo
SET TIME ZONE 'UTC'
SHOW TIMEZONE
ALTER USER <username> SET timezone='UTC'
-
POSTGRESQL uses ISO-8001 (format of date and time)
-
YYYY-MM-DDTHH:MM:SS
-
2017-08-17T12:47:16+02.00
-
it is 12:47:16 o'clock at +02.00 time zone
-
format is a way to represent date and time
-
Timestamp is a date with time and timezone info
SELECT now()
- Get current date
SELECT now()::date;
SELECT CURRENT_DATE;
- Formatting date
SELECT TO_CHAR(CURRENT_DATE, 'dd/mm/yyyy');
- Date Different
SELECT now() - '1800/01/01';
- To date (cast string to date)
SELECT date '1800/01/01';
- Age
SELECT AGE(date '1800/01/01');
SELECT AGE(date '1992/11/13', date '1800/01/01');
- Extract
SELECT EXTRACT (DAY FROM date '1992/11/13') AS DAY;
SELECT EXTRACT (MONTH FROM date '1992/11/13') AS MONTH;
SELECT EXTRACT (YEAR FROM date '1992/11/13') AS YEAR;
- Rounding date:
year
,month
,week
,day
SELECT DATE_TRUNC ('year', date '1992/11/13');
- Interval
SELECT *
FROM orders
WHERE purchaseDate <= now() - INTERVAL '30 days'
SELECT EXTRACT (
year
FROM INTERVAL '5 years 20 months'
)
- distinct for combination of column
SELECT DISTINCT <col1>, <col2> FROM <table>;
- ASC is default
SELECT * FROM customers
ORDER BY <column1> [ASC/DESC], <column2> [ASC/DESC]
SELECT * FROM customers
ORDER BY length(first_name) DESC
- inner join (using where)
SELECT a.emp_no,
CONCAT(a.first_name, a.last_name) as "name",
b.salary
FROM employees as a, salaries as b
WHERE a.emp_no = b.emp_no
ORDER BY a.emp_no
- inner join (using join)
SELECT a.emp_no,
CONCAT(a.first_name, a.last_name) as "name",
b.salary
FROM employees as a
[INNER] JOIN salaries as b ON b.emp_no = a.emp_no;
- self join
- happen when a table has a foreign key referencing its primary key
id | name | startDate | supervisorId |
---|---|---|---|
1 | Mo Binni | 1990/01/13 | 2 |
2 | Andrei Neagoie | 1980/01/23 | 2 |
- self join using where
SELECT a.id, a.name as "employee", b.name as "supervisor name"
FROM employee as a, employee as b
WHERE a.supervisorId = b.id
- self join using inner join
SELECT a.id, a.name as "employee", b.name as "supervisor name"
FROM employee as a
INNER JOIN employee as b
ON a.supervisorId = b.id
-
outer join
- get also the row that don't match
-
left outer join
SELECT *
FROM <table A> AS a
LEFT [OUTER] JOIN <table b> AS b
ON a.id = b.id
- right outer join
SELECT *
FROM <table A> AS a
RIGHT [OUTER] JOIN <table b> AS b
on a.id = b.id
-
uncommon join
-
USING key word
SELECT a.emp_no,
CONCAT(a.first_name, a.last_name) as "name",
b.salary
FROM employees as a
INNER JOIN salaries as b USING(emp_no)
-- `USING(emp_no)` is same as `ON b.emp_no = a.emp_no`
- every column not in the group-by clause must apply a function
- group by utilize split-apply-combine strategy
SELECT dept_no, COUNT(emp_no)
FROM dept_emp
GROUP BY dept_no
- Order of operation
flowchart TB
FROM-->WHERE
WHERE-->groupby(GROUP BY)
groupby(GROUP BY)-->HAVING
HAVING-->SELECT
SELECT-->ORDER
- filter on group (
WHERE
is occur before group-by. Thus, we needHAVING
to filter on group)
SELECT col1, COUNT(col2)
FROM <table>
WHERE col2 > X
GROUP BY col1
HAVING col1 === Y;
- ORDER BY for group
SELECT d.dept_name, COUNT(e.emp_no) AS "# of employees"
FROM employees AS e
iNNER JOIN dept_emp AS de ON de.emp_no = e.emp_no
INNER JOIN departments AS d ON de.dept_no = d.dept_no
WHERE e.gender = 'F'
GROUP BY d.dept_name
-- HAVING count(e.emp_no) > 25000
ORDER BY "# of employees" DESC
- UNION
SELECT NULL AS "prod_id", SUM(ol.quantity)
FROM orderlines AS ol
UNION
SELECT prod_id AS "prod_id", sum(ol.quantity)
FROM orderlines AS ol
GROUP BY prod_id
ORDER BY prod_id DESC
LIMIT 5
- GROUP SET
SELECT prod_id AS "prod_id", sum(ol.quantity)
FROM orderlines AS ol
GROUP BY
GROUPING SETS (
(),
(prod_id)
)
ORDER BY prod_id DESC
LIMIT 5
- ROLLUP
- return combination of group set in rollup
SELECT EXTRACT (YEAR FROM orderdate) AS "year",
EXTRACT (MONTH FROM orderdate) AS "month",
EXTRACT (DAY FROM orderdate) AS "day",
SUM(ol.quantity)
FROM orderlines AS ol
GROUP BY
ROLLUP (
EXTRACT (YEAR FROM orderdate),
EXTRACT (MONTH FROM orderdate),
EXTRACT (DAY FROM orderdate)
()
)
-- apply `HAVING` to reduce number of output
HAVING (EXTRACT (YEAR FROM orderdate) = 2004 OR EXTRACT (YEAR FROM orderdate) IS NULL) AND
(EXTRACT (MONTH FROM orderdate) = 1 OR EXTRACT (MONTH FROM orderdate) IS NULL)
ORDER BY
EXTRACT (YEAR FROM orderdate),
EXTRACT (MONTH FROM orderdate),
EXTRACT (DAY FROM orderdate)
- Window functions crete a new column based on functions performed on a subset or "window" of the data
window_function(arg1, arg2, ...) OVER (
[PARTITION BY partition_expression]
[ORDER BY sort_expression [ASC | DESC] [NULLS {FIRST | LAST}]]
)
- PARTITION BY: divide rows into groups to apply the function against (optional)
- ORDER BY: order the results
- order by change the frame of window function (to cumulative)
- When using a frame clause in a window function we can create a sub-rance or frame
Key | Meaning |
---|---|
ROWS or RANGE | Whether you want to use a range or rows as a frame |
PRECEDING | Rows before the current one |
FOLLOWING | Rows after the current one |
UNBOUND PRECEDING or FOLLOWING | Returns all before or after |
CURRENT ROW | Your current row |
PARTITION BT category ORDER BY price RANGE BETWEEN UNBOUND PRECEDING AND CURRENT ROW
- without
ORDER BY
by default the framing is usually all partition rows
Function | Purpose |
---|---|
SUM/MIN/MAX/AVG | Get the sum/min/max/avg of all records in the partition |
FIRST_VALUE | Return a value evaluated against the first row within its partition |
LAST_VALUE | Return a value evaluate against the last row within its partition |
NTH_VALUE | Return a value evaluated against the nth row in an ordered partition |
PERCENT_RANK | Return the relative rank of the current row (rank-1) / (total rows-1) |
RANK | Rank the current row within its partition with gaps |
ROW_NUMBER | Number the current row within its partition starting from 1 |
LAG/LEAD | Access values from the previous or next row |
- Return a value evaluated against the first row within its partition
SELECT
prod_id,
price,
category,
-- sort by price and get the first price. thus get the lowest price
FIRST_VALUE(price) OVER(
PARTITION BY category ORDER BY price RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWiNG
)
FROM products
- Return a value evaluate against the last row within its partition
SELECT
o.orderid,
o.customerid,
o.netamount,
SUM(o.netamount) OVER(
PARTITION BY o.customerid ORDER BY o.orderid
) as "cum sum"
FROM orders as o
ORDER BY o.customerid
- Number the current row within its partition starting from 1
SELECT
prod_id,
price,
category,
row_number() OVER(PARTITION BY category ORDER BY price) AS "position in category by price"
FROM products
- conditional select
SELECT a,
CASE
WHEN a=1 THEN 'one'
WHEN a=2 THEN 'two'
ELSE 'other'
END
FROM test;
- conditional filter
SELECT o.orderid,
o.customerid,
o.netamount
FROM orders AS o
WHERE CASE
WHEN o.customerid > 10
THEN o.netamount < 100
ELSE o.netamount > 100
END
ORDER BY o.customerid
- conditional aggregate function
SELECT
SUM(
CASE
WHEN o.netamount < 100
THEN -100
ELSE o.netamount
END
) as "returns",
SUM(o.netamount) as "normal total"
FROM orders AS o
NULLIF(val_1, val_2)
- return
NULL
ifval_1
=val_2
.
- Non-materialized views: query gets re-run each time the view is called on
- Materialized views: store the data physically and periodically updates it when tables change
-
views are the output of the query
-
views act like tables (we can query them)
-
Non-materialized views use very little space. Only store the definition of a view not all the data
-
create view
CREATE VIEW view_name AS query
- replace view
CREATE OR REPLACE <view_name> AS query
- rename view
ALTER VIEW <view_name> RENAME TO <view_name>
- delete view
DROP VIEW [ IF EXISTS ] <view_name>
- current salary
-- CREATE VIEW last_salary_change AS
CREATE OR REPLACE VIEW last_salary_change AS
SELECT e.emp_no,
MAX(s.from_date)
FROM salaries AS s
JOIN employees AS e USING(emp_no)
JOIN dept_emp AS de USING(emp_no)
JOIN departments AS d USING(dept_no)
GROUP BY e.emp_no
ORDER BY e.emp_no;
SELECT * FROM last_salary_change LIMIT 5;
-- use view from above block
SELECT s.emp_no, d.dept_name, s.from_date, s.salary FROM last_salary_change
JOIN salaries AS s USING(emp_no)
JOIN dept_emp AS de USING(emp_no)
JOIN departments AS d USING(dept_no)
WHERE s.from_date = max
ORDER BY s.emp_no
- Index: is a construct to improve querying performance
- it like a table of contents
- speed up queries
- slow down data insertion and updates
- Single-Column
- Multi-Column
- Unique
- Partial
- Implicit Indexes
- create index
CREATE UNIQUE INDEX <name>
ON <table> (COLUMN1, COLUMN2, ...);
- delete index
DROP INDEX <name>
- index foreign keys
- index primary keys and unique columns
- index on columns that end up in the
ORDER BY
/WHERE
clause often
- Do not add an index just to add an index
- Do not use indexes on small tables
- Do not use on tales that are updated frequently
- Do not use on columns that can contain null values
- Do not use on columns that have large values
- Most frequently used column in a query (in
WHERE
clause) - Retrieving data that satisfies one condition
- Most frequently used columns in a query (in
WHERE
clause) - Retrieving data that satisfies multiple condition
- for speed and integrity
CREATE UNIQUE INDEX <name>
on <table> (column1)
- index over a subset of a table
CREATE INDEX <name>
on <table> (column1) <expression>
-
include non-key column when create index
-
no need to go to heap to get value in column 2
-
if index table is too big, it will not fit to memory and it will be slow for scan
CREATE INDEX <name>
on <table> (<key-column1>) include (<non-key-column2>)
- automatically create by the database:
- primary ley
- unique key
- create partial index
CREATE INDEX idx_countrycode
ON city (countrycode) WHERE countrycode IN ('TUN', 'BE', 'NL')
EXPLAIN ANALYSE
SELECT "name", district, countrycode FROM city
WHERE countrycode IN ('TUN', 'BE', 'NL')
- ProtgreSQL provides several types of indexes algorithms
- B-Tree
- HASH
- GIN
- GIST
- Each index type uses a different algorithm
CREATE [UNIQUE] INDEX <name>
ON <table> USING <method> (column1, ...)
-
B-Tree
- default algorithm
- best for comparison with <, <=, =. >=, BETWEEN, IN, IS NULL, IS NOT NULL
-
Hash
- can only handle
=
- can only handle
-
GIN (Generalized Inverted Index)
- Best used when multiple values are stored in a single field
-
GIST (Generalized Search Tree)
- Useful in indexing geometric data and full-test search
-
subsuery: construct that allows you to build extremely complex queries
-
also called: inner query, inner select
-
subquery in
WHERE
clause
SELECT *
FROM <table>
WHERE <column> <condition> (
SELECT <column>
FROM <table>
[WHERE/ GROUP BY/ ORDER BY/ ...]
)
- subquery in
SELECT
clause
SELECT (
SELECT <column>
FROM <table>
[WHERE/ GROUP BY/ ORDER BY/ ...]
)
FROM <table> AS <name>
- subquery in
FROM
clause
SELECT *
FROM (
SELECT <column>, <column>, <column>, ...
FROM <table>
[WHERE/ GROUP BY/ ORDER BY/ ...]
) AS <anme>
- subquery in
HAVING
clause
SELECT *
FROM <table> AS <name>
GROUP BY <column>
HAVING (
SELECT <column>
FROM <table>
[WHERE/ GROUP BY/ ORDER BY/ ...]
) > X
-
Both subquery and JOIN combine data from different tables
-
subqury
SELECT title, price, (SELECT AVG(price) FROM products) AS "global average price"
FROM products
-
Subqueries are queries that could stand alone
-
Subqueries can return a single result or a row set
-
Subqueries results are immediately used
-
join
SELECT prod_id, title, price, quan_in_stock
FROM products
JOIN inventory USING(prod_id)
- Join combine rows from one or more tables based on a match condition
- Join can only return a row set
- Join table can be used in the outer query
- If we are able to use the join, use the join. it is better performance
- subquery must be enclosed in parenthesis
- must be place on the right side of the comparison operator
- cannot manipulate their results internally (order by ignored)
- use single-row operators with single-row subqueries
- subquery that return null may not return results
- Single row: return zero or one row
SELECT name, salary
FROM salaries
WHERE salary = (SELECT AVG(salary) FROM salaries)
- Multiple row: return one or more rows
SELECT title, price, category
FROM products
WHERE category IN (
SELECT category FROM categories
WHERE categoryname IN ('Comedy', 'Family', 'Classics')
)
- Multiple column
SELECT emp_no, salary, dea.avg AS "Department average salary"
FROM salaries AS s
JOIN dept_emp AS de USING(emp_no)
JOIN (
SELECT dept_no, AVG(salary) FROM salaries AS s2
JOIN dept_emp AS e USING(emp_no)
GROUP BY dept_no
) AS dea USING(dept_no)
WHERE salary > dea_avg
- Correlated: reference one or more columns in the outer statement - runs against each row
SELECT emp_no, salary, from_date
FROM salaries AS s
WHERE from_date = (
SELECT max(s2.from_date) AS max
FROM salaries AS s2
WHERE s2.emp_no = s.emp_np
)
ORDER BY emp_no
- Nested : subquery inside subquery
SELECT orderlineid, prod_id, quantity
FROM orderlines
JOIN (
SELECT prod_id
FROM products
WHERE category IN (
SELECT category FROM category
WHERE categoryname IN ('Comedy', 'Family', 'Classics')
)
) AS limited USING (prod_id)
EXISTS
: check if the subquery return any rows
SELECT firstname, lastname, income
FROM customers AS c
WHERE EXISTS (
SELECT * FROM orders AS o
WHERE c.customerid = o.customerid AND totalamount > 400
) AND incomer > 90000
IN
: check if the value is equal to any of the rows in the return- (Null yields null)
SELECT prod_id
FROM products
WHERE category IN (
SELECT category FROM categories
WHERE categoryname IN ('Comedy', 'Family', 'Classics')
)
NOT IN
: check if the value is equal to any of the rows in the return- (Null yields null)
SELECT prod_id
FROM products
WHERE category IN (
SELECT category FROM categories
WHERE categoryname NOT IN ('Comedy', 'Family', 'Classics')
)
ANY
/SOME
: check each row against the operator and if any comparison matches return true
SELECT prod_id
FROM products
WHERE category = ANY (
SELECT category FROM categories
WHERE categoryname IN ('Comedy', 'Family', 'Classics')
)
ALL
: check each row against the operator and if all comparisons match return true
SELECT prod_id, title, sales
FROM products
JIN inventory AS i USING(prod_id)
WHERE i.sales > ALL (
SELECT AVG(sales) FROM inventory
JOIN products AS p1 USING (prod_id)
GROUP BY p1.category
)
- Single Value Comparison : subquery must return a single row check comparator against row
SELECT prod_id
FROM products
WHERE category = (
SELECT category FROM categories
WHERE categoryname IN ('Comedy')
)
-
Regular
-
Template
-
When you setup, PostgreSQL create 3 databases
- Postgres
- Template0
- Template1
-
create database
psql -U <user> <database>
- default database name = user
psql -U postgres
postgres=# \connection
-
Template0
- use to create template1
- never change it
- backup template
-
Template1
- use to create new databases
CREATE DATABASE name
[ [WITH] [ OWNER [=] user_name ]
[ TEMPLATE [=] template ]
[ ENCODING [=] encoding ]
[ LC_COLLATE [=] la_collate ]
[ LC_CTYPE [=] lc_ctype ]
[ TABLESPACE [=] tablespace ]
[ CONNECTION LIMIT [=] connlimit ]]
Setting | Default |
---|---|
TEMPLATE | template01 |
ENCODING | UTF8 |
CONNECTION_LIMIT | 100 |
OWNER | Current user |
- create database
CREATE DATABASE <db_name>
- delete database
DROP DATABASE <db_name>
-
databases contain many tables, view, etc..
-
may want to organize them in logical way
-
Postgres Schemas
- it is like a box to organize tables, views, indexes, etc.
public
schema is default
-- not specify schemas, default is public
SELECT * FROM employees
-- is the same as
SELECT * FROM public.employees
- list all schemas
postgres=# \dn
- create schema
CREATE SCHEMA sales;
-
to allow many users to use one database without interfering (e.g. same tablename in different schema)
-
to organize database objects into logical groups to make them more manageable
-
3rd-party application can be put into separate schemas. so, they do not collide with the names of other objects
- crating databases is a restricted action. not every one is allowed to do it.
- permission management
- Roles: have attributes and privileges
-
createdb / nocreatedb
-
superuser / nosuperuser
-
createrole / nocreaterole
-
login / nologin
-
password
-
creating a role
CREATE ROLE readonly WITH LOGIN ENCRYPTED PASSWORD 'readonly'
-
by defaults, only
creator
of the database orsuperuser
has access to the database object -
creating user
CREATE USER user1 WITH ENCRYPTED PASSWORD 'user1'
- Granting privileges
GRANT ALL PRIVILEGES ON <table> TO <user>
GRANT ALL ON ALL TABLES [IN SCHEMA <schema>] TO <user>
GRANT [SELECT, UPDATE, INSERT, ...] ON <table> [IN SCHEMA <schema>] TO <user>
REVOKE [SELECT, UPDATE, INSERT, ...] ON <table> FROM <user>
REVOKE ALL ON ALL TABLES [IN SCHEMA <schema>] FROM <user>
- Principle of least privilege
- Types:
Numeric Types
,Arrays
,Character Types
,Date/Time Types
,Boolean Types
,UUID Types
, etc. - Data Types is constraint of data to be filled
TRUE
,FALSE
,NULL
- Smart Conversion:
TRUE
:1
,yes
,y
,t
,true
False
:0
,no
,n
,f
,false
- CHAR(N), VARCHAR(N), TEXT
- CHAR(10) : fixed length with space padding
- eg.
mo········
- eg.
- VARCHAR(10) : variable length with no padding
- TEXT : unlimited length of text
- Integer:
- Smallint: -32,768 to 32,767
- Int: -2,147,483,648 to 2,147,483,647
- Bigint: -9.2e18 to 9.2e18
- Floating point
- Float4: Single precision (6 digit precision)
- Float8: Double precision (15 digit precision)
- Decimal/Numeric: 131072 digits before decimal point and 16383 digits after decimal point
- Arrays: group of element of the same type
CREATE TABLE test_text (
four char(2)[],
eight text[],
big float4[]
);
INSERT INTO test_text VALUES (
ARRAY ['mo', 'mo', 'm', 'd'],
ARRAY ['test', 'long text', 'longer text'],
ARRAY [1.23, 2.11, 3.23, 5.321468864]
);
- Table names must be
singular
! - Column :
snake_case
, or mixed case such asstudent_ID
CREATE TABLE <name> (
<col1> TYPE [CONSTRAINT],
table_constraint [CONSTRAINT]
) [INHERITS <existing_table>];
- Temporary tables
- They are a type of table that exist in a special schema, so you cannot define a schema name when declaring a temporary table.
- Use Temporary tables is because:
- Temporary tables behave just like normal ones
- Postgres will apply less “rules” (logging, transaction locking, etc.) to temporary tables so they execute more quickly
- You have full access rights to the data, if you otherwise didn’t so you can test things out.
CREATE TEMPORARY TABLE <name> (<col1>);
Constraint | Meaning |
---|---|
NOT NULL | cannot be null |
PRIMARY KEY | column will be the primary key |
UNIQUE | can only contain unique values(NULL is Unique) |
CHECK | apply a special condition check against the values in the column |
REFERENCES | constrain the values of the column to only be values that exist in the column of another table (Foreign Key) |
Constraint | Meaning |
---|---|
UNIQUE (column_list) | can only contain unique value (NULL is Unique) |
PRIMARY KEY (column_list) | columns that will be the primary key |
CHECK (condition) | a condition to check when inserting or updating |
REFERENCES | Foreign key relationship to column |
- Table constraint is defined at the bottom
- Every column constraint can be written as a table constraint
- BEST PRACTICE: if constraint related to one column, write it as column constraint. if the constraint related to multiple columns, write it as table constraint.
CREATE TABLE student (
strudent_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
first_name VARCHAR(255) NOT NULL,
last_name VARCHAR(255) NOT NULL,
email VARCHAR(255) NOT NULL,
date_of_birth DATE NOT NULL,
CONSTRAINT pk_student_id PRIMARY KEY (student_id)
);
- install extension
CREATE EXTENSION IF NOT EXISTS "UUID-OSSP";
- get all extension
SELECT * FROM pg_available_extensions;
-
UUID (Universally Unique Identifier) generate unique identifier for primary keys
-
Pro
- unique everywhere
- easier to shard
- easier to merge/replicate
- expose less information about your system
-
Con
- larger values to store
- can have performance impact
- more difficult to debug
CREATE DOMAIN Rating SMALLINT
CHECK (VALUE > 0 AND VALUE <= 5);
CREATE TYPE Feedback AS (
student_id UUID,
rating Rating,
feedback TEXT
);
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
ADD COLUMN <col> <type> <constraint>;
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
ALTER COLUMN <name> TYPE <new type> [USING <expression>];
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
RENAME COLUMN <old name> TO <new name>
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
DROP COLUMN <col> TO [ RESTRICT | CASCADE ]
INSERT INTO student(
first_name,
last_name,
email,
date_of_birth
) VALUES (
'Mo',
'Binni',
'mo@binni.io',
'1992-11-13'::DATE
);
- What need to be backed up
- Full Backup : Backup all the data : less often
- Incremental : Backup since last incremental : often
- Differential : Backup since last full backup : often
- Transaction Log : Backup od the database transaction (real-time snapshot) : most frequent
-
Appropiate way to backup (OS, HDD, or only database)
-
How frequently?
-
Where to store backups
-
Retention Policy (How long to store?)
- Create dump
- LOad dump
- Transaction : Units of instruction
- Transaction keep thongs consistent
flowchart LR
BEGIN-->ACTIVE
ACTIVE-->id1(PARTIALLY COMMITTED)
id1(PARTIALLY COMMITTED)-->FALIED
ACTIVE-->FALIED
id1(PARTIALLY COMMITTED)-->COMMITTED
COMMITTED-->END
FALIED-->ABORTED
ABORTED-->END
BEGIN;
DELETE FROM employees WHERE emp_no BETWEEN 10000 AND 10005; -- partially commit
SELECT * FROM employees;
ROLLBACK; -- not commit (ABORTED)
BEGIN; -- locking databases
DELETE FROM employees WHERE emp_no BETWEEN 10000 AND 10005; -- partially commit
SELECT * FROM employees;
END; -- commit (COMMITTED)
- Transaction is to maintain the integrity of a database, all transactions must obey
ACID
properties
- Atomicity: either execute entirely or not at all
- Consistency: each transaction should leave the database in a consistent state (COMMIT or ROLLBACK)
- Isolation: transaction should be executed in isolation from other transactions
- Duration: after completion of a transaction, the changes in the database should presist
flowchart TB
p1-->p2("phase 2:\nSystem Analyse")
p2-->p3("phase 3:\nSystem Design")
p3-->p4("phase 4:\nSystem Implementation and Operation")
p4-->p1("phase 1:\nSystem Planning and Selection")
-
The goal is robust systems!
-
Process implementation: Agile, Waterfall, V-Model, ...
-
SDLC Phase 1 : Getting information on what needs to be done (scope)
-
SDLC Phase 2 : taking requirements and analyzing if it can be done on time and on budget
-
SDLC Phase 3 : designing the system architecture for all related components databases, apps, etc.
-
SDLC Phase 4 : building the software
-
There are more phase: Testing, Maintenance
-
Phase 1 / 2 is more related to business stakeholders and architect at higher level
-
Phase 3 / 4 is closer to implementation design and software program (more related to individual software engineer)
-
System Design is all about creaing structure that can be understood and communicated
- Top-Down
- Bottom-Up
- Start from 0
- Optimal choice when creating a new database
- All Requirements are gathered up-front
- There is an existing system or specific data in place
- Want to shape a new system aroud the existing data
- Optimal choice when migrating an existing database
- often we will use a bit of both top-down and bottom-up.
-
Requirements
-
DriveMe is a driving school where people can take lessons based across the USA.
-
Every school has instrctors on payroll and an inventory of cars, truck and Motocycles for teaching.
-
Become a household name across the USA for learning how to drive.
-
Currently DriveMe has outdated website and their customer acquisition is mostly word of mouth.
-
They want to start gaining marketshare through an online presence
-
-
Core Requirements
- There is a vehicle inventory for students to rent
- There are employees at every branch
- There is maintenance for the vehicles
- There is optional exam at the end of your lessons
- You can only take the exam twice, if fail twice, you must take more lessons.
-
Goal: to create a data model based on requirements
-
Requirements:
- high-level requirements
- user interviews
- data collection
- deep understanding
-
Method: ER Model
flowchart LR
p3("phase 3:\nSystem Design")-->c1{"How to design?"}
c1-->|Top-Down|c2("ER Modeling")
c1-->|Bottom-Up|c3("???")
-
What is an entity?
- a person/place or thing
- has a singular name
- has an indentifier
- should contain more than one instance of data
-
DriveMe Entities: Student, School, Vehicle, Instructor, Maintenance, Exam, Lesson
erDiagram
School ||--|| Instructor : has
School ||--|| Student : has
Instructor ||--|| Lesson : teaches
Student ||--|| Lesson : takes
Student ||--|| Exam : ""
Lesson ||--|| Vehicle : ""
Vehicle ||--|| Maintainence : ""
- Give entities the information they will store
- Must be property of the entity
- Must be atomic (smallest amount of data) e.g., address is not atomic it hold house number, street name, country etc.
- Single/Multivalued (Phone Number)
- Keys
- Relation Schema : header of table
- Relation Instance : all of the rows of the table
-
Relation Key
: uniquely identify the row and the relationshipSuper Key
: Combination of attribute that could uniquely identify rows. e.g.,id
&firstName
Candidate Key
: Minimal anount of attribute that could uniquely identify rows. (Candidate Key is subset of Super Key)Primary key
: selected only one candidate keyForeign key
Compound key
: super key that include foreign keyComposite key
: super key that not include foreign keySurrogate key
: a primary key that is not involve with individual data (synthetic primary key). it is generated.Alternate key
: is the secondary candidate key that contains all the property of a candidate key but is an alternate option.
-
DriveMe Attributes
erDiagram
%% Entity
School {
attr school_id
attr street_name
attr street_number
attr postal_code
attr state
attr city
}
Instructor {
attr teacher_id
attr first_name
attr last_name
attr data_od_birth
attr hiring_date
attr school_id
}
Student {
attr student_id
attr first_name
attr last_name
attr data_od_birth
attr enrollment_date
attr school_id
}
Exam{
attr student_id
attr teacher_id
attr date_taken
attr passed
attr lesson_id
}
Lesson{
attr lesson_id
attr date_of_enrollment
attr package
attr student_id
}
%% Relationship
School ||--|| Instructor : has
School ||--|| Student : has
Instructor ||--|| Lesson : teaches
Student ||--|| Lesson : takes
Student ||--|| Exam : ""
Lesson ||--|| Vehicle : ""
Vehicle ||--|| Maintainence : ""
- Determine the relationship between entities
- Links 2 entitiess together:
- 1 to 1
- 1 to many
- many to many
erDiagram
Entity |o--o| Zero-or-One : ""
Entity ||--|| Exactly-One : ""
Entity }o--o{ Zero-or-More : ""
Entity }|--|{ One-or-More : ""
- format
- first line: upper bound
- second line: lower bound
<left-entity> <first-line><second-line>--<second-line><first-line> <right-entity>
- DriveMe Relationship
erDiagram
%% Entity
School {
attr school_id
attr street_name
attr street_number
attr postal_code
attr state
attr city
}
Instructor {
attr teacher_id
attr first_name
attr last_name
attr data_od_birth
attr hiring_date
attr school_id
}
Student {
attr student_id
attr first_name
attr last_name
attr data_od_birth
attr enrollment_date
attr school_id
}
Exam{
attr student_id
attr teacher_id
attr date_taken
attr passed
attr lesson_id
}
Lesson{
attr lesson_id
attr date_of_enrollment
attr package
attr student_id
}
%% Relationship
School ||--|{ Instructor : has
School ||--|{ Student : has
Instructor ||--|{ Exam : ""
Instructor ||--|{ Lesson : teaches
Student ||--|{ Lesson : takes
Student ||--|{ Exam : ""
Lesson ||--|{ Exam : ""
Lesson ||--|| Vehicle : ""
Vehicle ||--o{ Maintainence : ""
-
In relational model, it is impossible to store many to many relationship
-
techinically possible but will lead to more over head: insert overhead, update overhead, delete overhead, potential redundancy
-
Rule of Thumb: Always try to resolve many to many
erDiagram
Book }|--|{ Author : ""
- Add intermediate entities (intermediate table)
erDiagram
Book ||--|{ Book_Author : ""
Book_Author }|--|| Author : ""
-
Divide entities into logical groups that are related (think schemas)
-
This step is need for distributed level at a global level
-
Subject Area Rules:
- All entities must belong to one subject area
- An entity can only belong to one
- You can nest subject areas
-
DriveMe Subject Area
-
a rich business man has tons of paintings.
-
he want to build a system to catalog and track where his art is
-
he lends it to museums all across the world
-
he want to see reservations
-
some constraints:
- a painting can only have one artist
-
ask about the system.
- goal?
- tracj painting reservation for a wealthy man
- stakeholders?
- owner, museums
- goal?
-
step 1: entities
- painting
- reservation
- museum
- artist
-
step 2: attributes
Entities Attributes Painting name, creation_date, style Reservation creation_date, date_from, date_to, accepted Artist name, birth_date, email Museum name, address, phone_number, email -
step 3: relationships
erDiagram
Painting }o--o{ Reservation : ""
Painting }o--|| Artist : ""
Reservation }o--|| Museum : ""
- step 4: solving many to many
erDiagram
Painting ||--o{ Reservation_Detail : ""
Reservation_Detail }o--|| Reservation : ""
Painting }o--|| Artist : ""
Reservation }o--|| Museum : ""
erDiagram
Movie }o--o{ Auditorium : ""
Auditorium }o--|| Theater : ""
- fix many to many
erDiagram
Movie ||--o{ Showing : ""
Showing }o--|| Auditorium : ""
Auditorium }o--|| Theater : ""
- create a data model from specific detail, existing systems, legacy systems
- indentify the data (attributes)
- group them (entities)
- create a perfact data model without
redundancy
andanomalies
- incorrected structure database
- 3 types:
- update anomalies
- insert anomalies
- delete anomalies
- update anomalies
- ensure the changes apply to all related data From this table, if Toronto brach changes the address, we need to update the same thing on many rows.
- insert anomalies
- check that data is consistency From this table, if someone insert customer id 5 with wrong address of the branch, it will cause inconsistency
- delete anomalies
-
ensure that we do not lose important data From this table, if we delete customer id 3, we will lose data of Scarborough branch.
flowchart LR
p3("phase 3:\nSystem Design")-->c1{"How to design?"}
c1-->|Top-Down|c2("ER Modeling")
c1-->|Bottom-Up|c3("Normalization")
-
functional dependencies
-
normal forms
-
functional dependency shows a relationship between attributes.
-
functional dependency exists when a relationship between two attributes allows you to uniquely determine the corresponding attribute's value
-
B
-->A
:A
is functional dependent onB
when a value ofB
determines a value ofA
determinant
-->dependate
branch_id
-->branch_assress
student_id
-->birth_date
employee_id
-->first_name
-
Normalizarion happens through a process of running attributes through the normal forms
-
0NF
->1NF
->2NF
->BCBF
->4NF
->5NF
->6NF
-
each normal form aims to furthur separate relationships into smaller instances as to create less redundancy and anomalies!
-
BCNF (Boyce-Codd Normal Form) or 3.5NF
-
0NF to BCNF are the most common normal form
-
4NF to 6NF is too extreme
- data that unnormalized:
- repeating groups of fields
- positional dependence of data
- non-atomic data
- eliminate repeating columns of the same data
- each attribute should contain a single value
- determine a primary key
-
example
0NF
color quantity price red, green, blue 20 9.99 yellow, orange, purple 10 10.99 blue, cyan 15 3.99 green, magento 200 15.99 -
normalization to
1NF
0NF 1NF color table: PRODUCT quantity prod_id <PK> price quantity price table: PRODUCT_COLOUR prod_id <FK> color
- data need to come from
1NF
- all non-key attributes are fully functional dependent on the primary key
-
example
0NF
Book Author1 Author2 Author3 1 1 2 3 2 2 2 3 3 3 2 1 -
normalization to
2NF
0NF | 1NF | 2NF |
---|---|---|
book | table: BOOK | table: BOOK |
author | book_id <PK> | book_id <PK> |
title | title | |
table: BOOK_AUTHOR | table: BOOK_AUTHOR | |
book_id <FK> | book_id <FK> | |
author_id | author_id <FK> | |
author_name | ||
author_address | table: AUTHOR | |
author_email | author_id <PK> | |
author_id | ||
author_name | ||
author_address | ||
author_email |
- data need to come from
2NF
- no transitive dependencies
-
Transitive Dependency
: isA
functionally dependent onB
, andB
is functionally dependent onC
. A is transitively dependent on C via B.B -> A
,C -> B
. Thus,A ~> C
-
normalization to
3NF
0NF | 1NF | 2NF | 3NF |
---|---|---|---|
branch | table: EMPLOYEE | table: EMPLOYEE | table: EMPLOYEE |
first name | first_name | emp_no <PK> | emp_no <PK> |
last name | last_name | first_name | first_name |
title | title | last_name | last_name |
hours | emp_no | title | title |
table: BRANCH | table: BRANCH | table: BRANCH | |
street | branch_no <PK> | branch_no <PK> | |
street_no | street | street | |
province | street_no | street_no | |
postal_code | province | province_id <FK> | |
branch_no | postal_code | postal_code | |
emp_no | country | ||
hours_logged | table: TIMESHEET | ||
country | table: TIMESHEET | branch_no <FK> | |
branch_no <FK> | emp_no <FK> | ||
emp_no <FK> | hours_logged | ||
hours_logged | |||
table: PROVINCE | |||
province_id <PK> | |||
country | |||
province |
- though on table
BRANCH
from2NF
tp3NF
- branch_no -> province
- province -> country
- branch_no ~> country
- data need to come from
3NF
- for any dependency
A -> B
.A
should be a super key
-
most relationships on
3NF
are also on BCNF but not all of them! -
3NF
allows attributes to be part of acandidate key
that is not theprimary key
-BCNF
does not -
A relationship is not in BCNF if:
- the primary key is a composite key
- there is more than one candidate key
- some attributes have keys in common
-
example
student_id | tutor_id | tutor_national_id |
---|---|---|
1 | 999 | 838 383 494 |
2 | 234 | 343 535 352 |
3 | 999 | 838 383 494 |
4 | 1234 | 354 464 234 |
-
candidate:
- [student_id, tutor_id]
- [student_id, tutor_sin]
-
functionally dependent
- tutor_id -> tutor_sin
- tutor_sin -> tutor_id
- [student_id, tutor_id] -> tutor_sin
- [student_id, tutor_sin] -> tutor_id
-
normalization to BCNF
student_id | tutor_id |
---|---|
1 | 999 |
2 | 234 |
3 | 999 |
4 | 1234 |
tutor_id | tutor_national_id |
---|---|
999 | 838 383 494 |
234 | 343 535 352 |
1234 | 354 464 234 |
- 4NF and 5NF are not generally used
- may results in over-normalization
- Vertical scalability : more resource in sigle machine
- Horizontal scalability : more machines
- split data
- replication data across different machine
- eventual consistency
- synchronous : wait for comfirms consistency from all replication before response to client (slow)
- asynchronous : response client immediately and send update to other replication later (faster)
- replication is in real time
- backup :
- store the backup of entire database
- do not do often
- expensive and slow
- Centralized databases : control by one organization
- Distributed databases :
- physically distributed to multiple location.
- control by many organization
-
ensure that user see only data they are authorized to see
-
prevent unthorized user from accessing database
-
prevent data corruption
-
detect and stop mulware attacks
-
Sanitize input
- format the input to what we expected
- Relational
- Pro:
- data integrity (normalization, no duplication/redundancy, no analmalies)
- acid transaction (consistency guarantee)
- use SQL to query (standard way)
- Con:
- schema : need to decide what type of data what type of table to store data at the beginning
- harder to scale horizaltally
- slow for query due to it is in different table. while MongoDB, related data will be kept in the same document and lead to faster query
- Pro:
- NewSQL: relational with horizontal scalability
- e.g.
citus
,vitess
,google spanner
,CockroachDB
- e.g.
- Elasticseach:
- document model database
- good for data that we need to search
- especially search for text e.g. book title
- massive blob of data like viedo
- PostgreSQL for SQL DB
- MongoDB, Amazon Document DB, Firebase for Document storage
- Elasticsearch for any sort oo searching of text
- Redis is in-memory key-value store
- Amazon S3 for blob store
- replicate data from production relational database to somthine like Hadoop or another type of database that is optimized form big data and analytics
- use CND to cache:
- HTML / Javascript file. no need to traverse to server
- cache on server for:
- API request
- databases
- memory store
- NoSQL, in-memory database
- classification of NoSQL
- Key-Value:
Redis
- Document:
MongoDB
,CouchDB
- Wide Column:
cassandra
- Graph:
Neo4j
- Key-Value:
- in-memory DB -> very fast
- used for short lived data
- small data (due to in-memory)
- Redis is key value store
SET <key> <value>
- set in
GET <key>
- get of
EXISTS <key>
- return 0 if there is no , 1 if there is
DEL <key>
- delete , return 1 if success, 0 if fail
EXPIRE <key> <seconds>
- set expire for in
INCR <key> // value of key = value of key + 1
INCRBY <key> <value> // value of key = value of key + <value>
DECR <key> // value of key = value of key - 1
DECRBY <key> <value> // value of key = value of key - <value>
MSET a 2 b 5
- multiple set
a = 2
andb = 5
MGET a b
- multiple get for
a
andb
- String
- Hashes (Hash table)
- Lists (linked list)
- Set
- Sorted sets
HMSET user id 45 name "john"
// correspond to python command below
user = {'id': 45, 'name': 'john'}
HGET user id
// return '45'
HGET user name
// return 'john'
HGETALL user
// return
// 1) "id"
// 2) "45"
// 3) "name"
// 4) "john"
- fast for insert
- slow to get
LPUSH outlist 10
// left push
RPUSH outlist "hello"
// right push
LRANGE <list-name> <start> <end>
LRANGE outlist 0 1
// get left of list since start to end
// return
// 1) "10"
// 2) "hello"
RPOP
// pop right
- similar to list but no duplicated element
// set add
SADD ourset 1 2 3 4 5
// set get
SMEMBERS ourset 1 2 3 4
// return 0, 1 whether the value is in the set or not
SISMEMBERS ourset 1
// sorted set add
ZADD team 1 "Bolts"
ZADD team 50 "Wizards"
ZADD team 40 "Cavalier"
// sorted set get
ZRANGE team 0 2
// return by ascending score
// 1) "Bolts"
// 2) "Cavaliers"
// 3) "Wizards"
ZRANK team "Wizards"
// retun rank of "Wizards" (start from 0)
// 2