CS3321 Database System Technology

This is the note of CS3321 Database System Technology(数据库技术), a course offered to Grade 3 Computer Science undergraduates at SJTU focusing on mathematical principle of relational normalizations and design theory of database instead usage of database. If you focus on practical usage of database systems such as writing SQL queries and building database applications, you may consider NIS3351 Database Principle and Security (unavailable) instead. Sacastically, NIS3351 has nothing to do with database security.

Some pictures and content in this note are taken from slides, which can be checked at course homepage or homepage of Qiang Yin, the lecturer.

Introduction

Database, DBMS and Database System

Database is an organized coolection of inter-related data that models some respects of the real-world.

Database Management System (DBMS) is a software that facilitates the creation and use of databases.

Database system is a combination of databases and DBMS.

DBMS is Complex

Example 1.1 How to find the public elements in two different lists? If one of them is too large to fit in memory? If both of them is so?

Solution 1.1 Use multiple hash functions and separate the records into many so small groups that the memory can fit in.

DBMS must make no assumption about its data.

  • Numerous corner cases exist.
  • A list can contain duplicates.
  • A single duplicate value may exceed the size of main memory.
  • Multiple ways to process a query but none of them efficiently covers all cases.

Relational Data Model

Key features for a relational data model: data integrity, implementation and durability.

A data model is a collection of concepts/tools for describing the data in a database.

  • A database is a collection of relations and each
    relation is an unordered set of tuples (or rows).
  • Each relation has a set of attributes (or columns).
  • Each attribute has a name and a domain and each
    tuple has a value for each attribute of the relation.
  • Values are atomic/scalar.

Schema and Instance

A unordered tuple of attributes is a relational schema which specifies the logical structure of data.

Instance is a concrete table content matching a certain schema.

The relationship between the two concepts is like the relationship between class and instance in C++.

Keys

Suppose is an attribute of a schema. Then is a superkey if values of is special enough to specify a certain tuple for each possible instance of the schema.

A superkey is a candidate key if is minimal, in the other words, does not contain any other keys.

A primary key is a designated candidate key of a relation.

A foreign key specifies that a tuple from one relation must map to a tuple in another relation.

Referenced attributed must be a primary key and no dangling pointers from the attributes of a foreign key are allowed.

Relation Algebra

Relational Algebra is a language for querying relational data based on fundamental relational operations. Each operation takes one or more relations as its input and output a new relation.

Operations can be composed to make complex queries.

Selection

Selection operation selects tuples that satisfy a given predicate.

The operation can be denoted by where is the input relation and is the selection predicate (condition).

Boolean connectives and logical connectives is allowed in the predicate.

Projection

The projection produces from an input relation a new that has only some of ’s attributes.

The operation can be denoted as where is the input relation and are attributes of .

Some simple calculations can be done in projection, for instance, .

In this operation, duplicated output tuples are removed since the input and output are both sets.

Removing duplicate results is optional in detailed instance of relational algebra. Multi-set instead of set is used in specific curriculum.

Cartesian product

The Cartesian Product of two relations and , denoted as , is the set of all possible combinations of ruples from and .

We include the relation name as a prefix in the attribute name of a product relation to avoid ambiguity such as . However, for simplicity, we drop the relation name prefix for attributes that appear only in or .

Join

The theta join (join) opertaion combines a selection operation and a Casterian-production operation into a single operation.

Theta join can be denoted as where is used to referred to as the join condition.

The natural join of and , denoted as , combines the tuples from and based on their common attributes.

The operation enforces equality between attributes that share the same time and retains one copy for each pair of identically named attributes.

If and has no common attributes, then .

Union

The union of and , denoted as , consists of all the tuples that appear in or . Duplicated tuples are removed by set semantics.

The two relations must have identical schema.

Difference

The difference of and , denoted as , consists of all the tuples appear in but not in .

Renaming

The renaming operation changes the name of relation to .

User should focus on describing relations instead of optimizing query logitics. Relational algebra is only descriptive language.

SQL: Structured Query Language

SQL: Structured Query Language, pronounced as “S-Q-L” or “sequel”.

DDL: Data Definition Language

DML: Data Manipulation Language

  • SQL is insensitive to case and white spaces;
  • Everything from -- to the end of the line is ignored.

SQL DDL

Built-in Data Types

SQL Data Types Usage
char(n) Fixed length string with length
varchar(n) Variable length string with max length
int, smallint Integer and small integer
numeric(p,d) Fixed point integer similar to decimal
real, double precision Floating point and double-precision floating point numbers
float(n) Floating point numer with precision at least digits

Machine dependent types are int, smallint, real and double precision.

Sizes of these data types are dependent to the machines.

Integrity Constraints

primary key (...): The attributes contained in the braces form the primary key for the relation.

foreign key (...) references ...: Values of the attributes contained in the braces must correspond to the primary key of reference table.

Referenced attributes must be a primary key and referencing attributes form a foreign key. No dangling pointers allowed.

SQL maintain the referential integrity in 3 ways:

  • Reject: SQL’s default option to handle updates. It rejects any update that violate the referential integrity.
  • Set as NULL: set all child records to NULL if original record is deleted.
  • CASCADE: propagated changes to all referring rows.

not null: For specified attributes, NULL value is not allowed.

Primary keys are not nullable.

1
2
3
4
5
6
7
8
CREATE TABLE instructor(
ID varchar(5),
name varchar(20) not null,
dept_name varchar(20),
salary numeric(8,2),
primary key(ID),
foreign key (dept_name) references department
);

UNIQUE Constraint

UNIQUE (...) statement declares that the attributes included in the braces form a super key.

CHECK

A CHECK (...) clause specifies a predicate condition that must be satisfied by every tuple in a relation.

Basic Database Modification

Insertion: insert a tuple into a table.

1
INSERT INTO instructor(ID, name) VALUES('10222','Root');

Deletion: purge tuples satisfying a given condition from certain table.

1
DELETE FROM instructor WHERE name='Turing';

DBMS will prevent any update to the database that would violate an integrity constraint.

Basic SQL Queries

A basic SQL query can be expressed by a SELECT-FROM-WHERE statement.

1
SELECT ID, name FROM instructor WHERE dept_name = 'SCS';

WHERE clause is optional and logical connectives such as AND, OR and NOT is allowed in the clause.

SELECT clause can contain expressions.

1
SELECT ID, name, salary/12 FROM instructor;

A relation name can be used as prefix to distinguish attributes with the same name.

1
SELECT student.name, instructor.name FROM student, advisor, instructor WHERE student.ID=advisor.S_IS AND advisor.i_ID =instructor.ID

SQL uses bag semantics by default where duplicates are allowed in query results. Keyword DISTINCT should be used to eliminate duplicates explicitly.

NULL

NULL is a special value for unknown or inapplicable values.

NULL is usually omitted in operations. For example, if a table has 4 lines but with a NULL value, the result of COUNT(A) is 3 and COUNT(*) is 4.

Evaluating aggregation functions except COUNT on an empty bag is NULL.

Special Rules

All arithmetic operations with NULL result in NULL, and comparsions with NULL result in UNKNOWN.

Aggreation functions ignore NULL except COUNT(*) which just counts rows.

Evaluating aggregation functions except COUNT on an empty bag returns NULL.

However, NULL cannot be used explicitly used an operand. For instance, NULL+3 is invalid.

In SQL we use a special three-value logic. TRUE is 1, FALSE is 0 and UNKNOWN is 0.5.

x AND Y clause returns the mininum value of x and y and x OR y returns the maximum. NOT x returns .

Pitfalls of NULL

NULL will break many equivalences.

1
2
3
SELECT * FROM instructor;
SELECT * FROM instructor WHERE salary>5000 OR salary<5000;
-- The two clauses are not equivalent.

Joins

Join expression is typically used in FROM clauses.

Theta Join and Natural Join

1
2
R JOIN S join_condition -- Theta Join
R NATURAL JOIN S -- Natural Join

For example:

1
2
3
4
5
6
-- student(ID, name, dept_name, tot_cred)
-- takes(ID, course_id, sec_id, semester, year, grade)

SELECT * FROM student JOIN takes ON studen.ID = takes.ID;
SELECT * FROM student, takes WHERE student.ID = takes.ID;
SELECT * FROM student JOIN takes USING(ID); -- Grammar treat

You can join more than 2 relations using natural join.

1
SELECT A_1,...A_n FROM R_1 NATURAL JOIN R_2 NATURAL JOIN ... R_k WHERE P;

USING

Attributs with the same name sometimes get equated expectedly in natural join.

One solution of this problem is to use WHERE and product to avoid joining on unrelated attribute, and the other solution is using USING keyword to specify which attribute(s) should be joined.

1
SELECT name,title FROM (student NATURAL JOIN takes) JOIN course USING (course_id);

Outer Join Motivation

A left outer join between and , denoted as includes both rows in and dangling rows padded with NULLs.

The definition of right outer join ,denoted as ⟖, and full join, denoted as ⟗, can refer to the definition above.

1
2
3
SELECT * FROM course NATURAL RIGHT OUTER JOIN prereq;
SELECT * FROM course RIGHT OUTER JOIN prereq ON course.course_id = prereq.course_id;
SELECT * FROM course RIGHT OUTER JOIN prereq USING course_id;

The difference between ON and WHERE condition should be noticed. When using ON, some records may be paddling.

ON is not a redundant keyword in SQL.

Subqueries

Nested Subqueries

A subquery is a SELECT-FROM-WHERE expression nested in another query.

Subquery can be nested in FROM, EXISTS, IN, WHERE and WITH query clause.

1
2
3
4
5
6
7
SELECT DISTINCT course_id
FROM section
WHERE semester = 'Fall' AND year = 2017 AND
course_id NOT IN (SELECT course_od
FROM section
WHERE semester = 'Spring' AND year = 2018)
;

Expression EXISTS(subquery) represents whether the result of subquery is non-empty.

An attribute refers to the most closely nested relation with that attribute.

ALL means for all record in subquery result and SOME means some.

A scalar subquery can be used as a value in WHERE, SELECT and HAVING clauses. Runtime error if subquery returns more than one row. A NULL will be used if subquery returns no rows.

Common Table Expression

1
2
3
4
WITH R1(A_1,A_2,...) AS (subquery_1),
R2(B_1,B_2,...) AS (subquery_2),
...
SELECT ... FROM ... WHERE ...;

Defines temporary relations can be used other relations refined in the same WITH clause and the actual query.

The temporary relation is only valid in the single statement.

CTE with Recursion

CTE is a keyword for recursion.

1
2
WITH RECURSIVE ReachableVertices(src, dst) AS (SELECT 
)

RECURSIVE CTE STATEMENT SHOULD BE FILLED.

Update

1
2
UPDATE instructor SET salary = salary*1.03 WHERE salary > 100000;
UPDATE instructor SET salary = salary*1.05 WHERE salary <= 100000;

Or with CASE clause:

1
2
3
4
UPDATE instructor SET salary = CASE
WHEN salary <= 100000 THEN salary * 1.05
ELSE salary*1.03
END;

Relational Database Design Theory

Functional Dependency Theory

A functional dependency (FD) is of the form that requires the attributes of functionally determine the attributes of .


Candidate Keys
A set of attributes is a candidate key for a relation schema if:

  1. (i.e., functionally determines all attributes in );
  2. No proper subset of functionally determines all attributes in (i.e., is minimal).

A set logically implies a functional dependency , denoted as , if every relation that satisfies all functional dependencies in also satisfies .


Closure of attributes
The closure of a set of attributes , denoted as , is the set of attributes that can be functionally determined by using a set of functional dependencies .

The subscript is often omitted when it is clear from the context.

Closure of attributes can be computed by repeatedly applying the functional dependencies in until no new attributes can be added to .

Time complexity is normally where is the number of functional dependencies, but can be optimized to with proper data structures.


Closure of functional dependencies
The closure of , denoted as , is the set of all functional dependencies that can be inferred from .

can be computed by repeatedly applying Armstrong’s axioms until no new functional dependencies can be added to .

Armstrong’s Axioms

  1. Reflexivity: If , then .
  2. Augmentation: If , then for any attribute set .
  3. Transitivity: If and , then .

Armstrong’s axioms are sound and complete.


Extraneous Attributes
An attribute is considered extraneous in a functional dependency if it can be removed from without changing .

  • is extraneous in if .
  • is extraneous in if where .

Canonical Cover
A canonical cover for a set of functional dependencies is a minimal set of functional dependencies such that . In other words, is a subset of that preserves the closure of but has no extraneous attributes or dependencies.

In a canonical cover, there are no FDs that contain extraneous attributes, and each LHS of a FD is unique.

To compute a canonical cover, we can use the following steps:

  1. Remove Extraneous Attributes: For each functional dependency in , remove any extraneous attributes from the left-hand side and right-hand side.
  2. Remove Redundant Dependencies: For each functional dependency in , check if it can be derived from the other dependencies. If it can, remove it from the set.
  3. Resulting Set: The resulting set of functional dependencies is the canonical cover.

Canonical cover is not unique, but the number of functional dependencies in any canonical cover for a given set of functional dependencies is the same.

Algorithm to compute canonical cover

A set of functional dependencies F
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Output: A canonical cover C for F
C := F
// Step 1: Remove extraneous attributes
for each functional dependency X -> Y in C do
for each attribute A in X do
if Y ⊆ (X - {A})^+ then
X := X - {A}
for each attribute B in Y do
if B ∈ X^+ where F = (C - {X -> Y}) ∪ {X -> (Y - {B})} then
Y := Y - {B}
// Step 2: Remove redundant dependencies
for each functional dependency X -> Y in C do
if Y ⊆ X^+ where F = C - {X -> Y} then
C := C - {X -> Y}
return C

Normalization Theory

Decomposition Criteria

Lossless Decomposition

Original relation schema should be able to be reconstructed by joining the decomposed relation schemas.

Suppose can be decomposed into and . If either or holds, then the decomposition is lossless.

Proof. Let be an instance of schema and we try to prove that . The following content will prove that since the opposite direction is trivial.

Suppose a record in is and there must exist and in such that and . Since , we have and thus and are identical on . Therefore, must be in . Q.E.D.


Redundancy and Anomalies Avoidance

Unnecessary redundancy and anomalies should be avoided in a decomposition.


Dependency preservation

Minimize the cost to check the integrity constraints defined in terms of functional dependencies.

Let be a set of FD’s on a schema and let be a decomposition of . The restriction of to is the set of all functional dependencies in that involve only attributes from .

A decomposition is dependency preserving if , where is the restriction of to .

BCNF decomposition is not always dependency preserving. If you focus more on dependency preservation rather tha n redundancy, you can refer to Third Normal Form.

Boyce-Codd Normal Form (BCNF)

A relation schema is in Boyce-Codd Normal Form (BCNF) if for every non-trivial functional dependency that holds in , is a superkey of .

A database schema is in BCNF if every relation schema in the database schema is in BCNF.

BCNF Decomposition Algorithm

1
2
3
4
5
6
7
Input:  A schema R and a set F of FD's
Output: A BCNF decomposition of R
D<-{R};
while there is a relation schema S in D that is not in BCNF do
choose a non-trivial X->Y in F^+ with XY\supset S such that X is not a superkey for S;
replace S in D by two relation schemas: S1=XY and S2=S-(Y-X);
return D;

Third Normal Form (3NF)

A relation schema is in Third Normal Form (3NF) if any FD in at least holds one of the following conditions:

  • is trivial;
  • is a superkey of ;
  • is part of a candidate key for .

Third Normal Form (3NF) Decomposition Algorithm

1
2
3
4
5
6
7
8
9
10
Input:  A schema R and a set F of FD's
Output: A 3NF decomposition of R that preserves F
C<-a canonical cover for F;
computes F^+;
for each X->Y\in F+ do
add relation schema XY to D;
if none of the relation schemas in D contains a candidate key for R then
add a relation schema that is a candidate key for R to D;
remove redundant relation schemas from D;
return D;

Let be a set of FD’s holds on a schema and let be a decomposition of . Furthermore, assume that for every FD in , there is some that contains both and and at least one schema in the decomposition contains a candidate key of . Then the decomposition is dependency preserving and lossless.

For other normal forms such as 1NF, 2NF and 4NF, refer to notes of NIS3351: Database Principle and Security(unavailable).

Storage

Why DBMS manage the buffer pool instead of letting OS do it?

Volatile Storage: loses contents when power if switched off, such as DRAM and CPU caches. It can be randomly accessed and bytes are individually addressable.

Non-volatile Storage: retains contents when power is switched off, such as magnetic disks and flash memory. It can be sequentially accessed but only in blocks.

Disk-oriented DBMSs are all about reducing disk I/O.

  • Sequential I/O is more cheaper than random I/O.
  • Cache blocks from non-volatile storage to volatile storage.

Storage Hierarchy

Storage Structures

In DBMS, data is stored in files which consists of a collection of pages. Each page holds a collection of tuples.

  • Heap file: Tuples are placed arbiitrarily in pages.
  • Sorted file: Tuples are sorted on some attribute(s).
  • Index file: A data structure that improves the speed of data retrieval operations on a database table, such as B+ tree and hash index.

Format of Page

A database page is a fixed-size block of data that is given a unique page number as identifier. Each page has a header that contains number of slots/tuples, free space, data checksum and transaction visibility.

A heap file is an unordered collection of pages where tuples are placed in arbitrary order. It requires meta-data to track existing pages and find free space.

Heap File Via Linked List

  • Linked List: A header page at start stores the HEAD of the data page list and free page list.
  • Page Directory: A special directory page is used to track data pages. Each directory entry records the number of free slots in its corresponding data page.

Heap File Via Page Directory

Page Structure: How are tuples organized within a database?

Slotted Page Structure

Slotted page structure is used in most DBMSs. In a slotted page, each tuple is stored in a fixed-size slot, and the page contains a header that keeps track of the free and occupied slots. This structure allows for efficient insertion and deletion of tuples, as the DBMS can easily find free slots and reuse them.

The header tracks number of slots, number of used slots, free space and an array of slot offsets.

When deleteing a tuple, the corresponding slot is marked as free but the space is not reclaimed until a compaction operation is performed.

Record Formats

In fixed-length format, each attribute has a fixed size and the size of each tuple is the sum of the sizes of its attributes. This format allows for efficient access to individual attributes, as their offsets can be easily calculated.

System catalog is a table that stores metadata about other tables. In fixed-length format, the system catalog stores the size of each attribute for each table. NULL values are handled by using a bit map at the beginning of the tuple.

In variable-length format, we move all variable-length attributes to the end of the tuple and use an array of offsets to locate them.

Buffer Pool Management

A buffer pool is a memory region organized as an array of fixed-size pages. Each array entry is called a frame which can hold one page.

MMAP is another way to manage buffer pool. It maps disk pages directly to memory addresses. However, it is less flexible and controllable than traditional buffer pool management.

Buffer Pool Metadata

The page table tracks which pages are currently in the buffer pool and maps page numbers to frame numbers.

Each frame has a pin or reference counter used for eviction decisions and a dirty bit which indicates whether the page has been modified.

Page Replacement Policies

When a new page is requested and the buffer pool is full, a page replacement policy is used to decide which page to evict. The final goal is to minimize cache misses.

  • LRU (Least Recently Used): Evict the page that has not been used for the longest time.
  • CLOCK: A more efficient approximation of LRU that uses a circular list and a reference bit.

Page replacement policy are usually specialized according to different access patterns.

Storage Layout

For a logical layout, there are several physical layouts such as row-oriented, column-oriented and hybrid layouts. Each layout has its own advantages and disadvantages depending on the access patterns of the workload.

Storage Layouts

Different Workload Patterns

  • OLTP (Online Transaction Processing): Workloads that involve a large number of short online transactions such as insert, update and delete. These workloads typically require low latency and high concurrency. As a result, row-oriented layout is often preferred for OLTP workloads.
  • OLAP (Online Analytical Processing): Workloads that involve complex queries for data analysis. These workloads typically require high throughput and can tolerate higher latency. As a result, column-oriented layout is often preferred for OLAP workloads.
  • HTAP (Hybrid Transactional/Analytical Processing): Workloads that combine both OLTP and OLAP characteristics. These workloads require a balance between low latency for transactions and high throughput for analytical queries.

ETL, standing for Extract, Transform, Load, is a process to transform data from OLTP systems to OLAP systems for analysis.

Cache Behavior Analysis

If size of cache line is less than size of tuple, then each tuple access may cause multiple cache misses. This can significantly impact performance, especially for workloads with large tuples or high access rates.

PAX: Partition Attributes Across

PAX, starting from 2002, can maximize inter-tuple spatial locality and minimize intra-tuple spatial locality.

PAX Storage Layout

Group values of the same attributes are stored in the same minipage within a page. Fixed-length attribute value are stored in F-minipages while variable-length attribute values are stored in V-minipages.

When accessing a tuple, only the relevant minipages need to be loaded into cache, reducing cache misses and improving performance.

Also, PAX supports SIMD for faster analytical queries.

Columnar Storage

Column-oriented storage stores each attribute in a separate column file. It is optimized for OLAP for better cache performance and fewer I/O when accessing a subset of attributes while also enabling high compression ratios and reducing storage space.

Data Organization of Apache Parquet

As a page is a collection of column values, each page can be compressed/encoded independently. Dictionary encoding and run-length encoding, RLE, are commonly used compression techniques.

RLE Dictionary Encoding

Indexing

Sequential scan and index scan are two common methods for accessing data in a database.

Index Basics

Index Data Structure

An index consists of a search key and a pointer to the actual data record.

An index file consists of records, called index entries, is usually much smaller than the actual data file.

This lecture focus on ordered indexes.

Dense Index: A dense index contains an index entry for every search key value in the data file. This allows for faster searches but requires more storage space.

It is possible that one index entry points to multiple records in the data file if there are duplicate search key values.

Sparse Index: A sparse index contains an index entry for only some of the search key values in the data file. This reduces storage space but can result in slower searches.

Sparse index is only applicable for sorted data files.

Clustering Index: A clustering index is an index that determines the physical order of data records in a table.

Non-Clustered Index: A non-clustered index is an index that does not affect the physical order of data records in a table. Instead, it creates a separate data structure that points to the actual data records.

A non-clustered index is also known as a secondary index which is always dense.

Index in SQL

1
2
CREATE INDEX index_name ON table_name (attribute1, attribute2,...);
DROP TABLE index_name;

A B-tree index is created for primary keys by default and a unique index is created for UNIQUE keys.

Concurrency Control in B-Tree

For concept of B-Tree, refer to CS0501 Data Structure.

Add a lock to the root node of tree every I/O seems too time-consuming. To improve the performance of concurrent operations on B-trees, latch is used to protect nodes during modifications. The latch crabbing protocol keeps the tree safe and concurrent.

Latch

Latch is a lightweight, short-term lock for in-memory structures which protects pages or nodes in buffer and index managers, and ensures physical (not transactional) consistency. It is implemented using spinlocks or mutexes.

A latch only allows multiple shared holders or a single exclusive holder.


Latch Crabbing Protocol enables concurrent access and modification to B-trees by acquiring and releasing latches in a specific order during tree traversal and modification.

  • Latch parent node
  • Latch child node
  • If child is safe, release parent latch

A node is considered safe if an update will not trigger a split or merge. In insertion it means that the node has enough free space, and in deletion it means that the node stays above the minimum occupancy after deletion.

Instead of locking the entire tree during an operation, latch crabbing allows threads to acquire and release latches on individual nodes as they traverse the tree. This allows multiple threads to work on different parts of the tree simultaneously, improving concurrency and performance.

Query Processing

Query Processing Overview


Parsing and Translation

DBMS checks the syntax and semantics of the SQL statement and builds a parse tree representing its structure. The parse tree is analyzed and then translated into a logical plan, a tree of relational algebra operators that captures the query’s semantics.


Query Optimization

Then, DBMS transforms the initial logical plan into an optimized physical plan consisting of two main phases:

  • Logical Optimization: Rewrites the logical plan into an equivalent but more efficient logical plan using RA rules.
  • Physical Optimization: Translates the optimized logical plan into a physical plan that specifies the actual algorithms and data structures to be used for query execution.

Query Execution

DBMS executes optimized physical plan to produce the final query result.

Common query execution models:

  • Materialization Model: Each operator fully computes and stores its result in temporary tables before passing it to the next operator.
  • Iterator (Volcano) Model: Operators are implemented as iterators that produce one tuple at a time, allowing for pipelined execution and reduced memory usage. Each operator implements three methods: open(), next(), and close()(optional).

Materialization model is good for OLTP workloads since it touches a few records at a time, but not good for OLAP.

Algorithms for Relational Operations

Notations

  • Number of pages:
  • Table:
  • Number of available buffer pool pages:
  • Cost metric: number of disk I/Os

Sorting

External sorting is required when data cannot fit in memory.

A divide-and-conquer algorithm called external merge sort is commonly used for external sorting. Recall that we have buffer pages available.

Read pages of data, sort them in memory using an efficient in-memory sorting algorithm (e.g., quicksort or heapsort), and write the sorted pages back to disk as sorted runs. This process is repeated until all data has been processed, resulting in sorted runs.

In the merge phase, we repeatedly merge sorted runs at a time using buffer pages for input and 1 buffer page for output. This process continues until only one sorted run remains.

Cost Analysis: The total cost of external merge sort can be expressed as:

By same idea, we can implement Sort-based Duplication Elimination. Cost is the same as external merge sort.

As in SQL, union, intersection and difference requires duplication elimination by default, they can be implemented using sort-based algorithms with cost similar to external merge sort.

Cost of Sort-based Aggregation is same as external merge sort.


Join

Cost of Naive Nested Loop Join is compared with Blocked Nested Loop Join whose cost is .

Cost of Index Nested Loop Join is where is the cost of accessing the index, usually 2~4 disk I/Os. If and support index lookup, better pick the smaller one as outer relation.

Sort-Merge Join requires both relations to be sorted on join attributes. If not, we need to sort them first. For most cases, the cost is . But in worst case, it can be as high as .

Hash Join is applicable for equality joins and natural joins. (Build Phase) Select a table and build a corresponding hash table in memory on the join attribute(s). (Probe Phase) For each tuple in the other table, use the hash table to find matching tuples.

Cost of this kind of join is if both relations fit in memory. Buffer pool requirement is .

If neither relation fits in memory, we can use Grace Hash Join. Both relations are partitioned into partitions using a hash function on the join attribute(s). Each partition of one relation is then joined with the corresponding partition of the other relation using an in-memory hash join. Cost of Grace Hash Join is .

Hash-based Algorithms: Duplication elimination, union, intersection, difference, group by aggregation.

Query Optimization

SQL is declarative. Users specify what tuples they want and query optimizer searches and picks the best query plan. Cost difference between query plans for a query can be huge. This section takes System R’s optimizer as an example.

Rule-based Query Rewriting (RBQ)

Push down selection: . By this rule, we can reduce the size of intermediate results in most conditions.

Push down projection: where is the set of attributes appear in . By this rule, we can reduce the size of intermediate results by removing unnecessary attributes early.

where consists of the set of attributes from that either in or appear in , vice versa for .

In visualization aspect, the optimization is actually push down selection and projection operators down the query plan tree and clear unnecessary attributes early.

Cost-based Query Optimization (CBQ)

RBQ omits consideration on cost. CBQ estimates the cost of different query plans and picks the cheapest one.

Cost Estimation

Cost of a query plan is estimated by summing up the cost of each operator in the plan and each operator’s cost can be considered as linear function of the size of its input.

DBMS stores statistics information about relations in system catalog to help estimate the size of intermediate results. Catalogs are updated periodically.

Selection With Equality Predicates assumes uniform distribution of values. For , if is a key, then size is 1; otherwise, size is where is the number of distinct values of attribute in relation .

For conjunctive predicates such as , the selectivity factor is estimated as with assumption of independence, no over-selection and both and are not keys.

For negative predicates such as , the selectivity is .

For disjunctive predicates such as , the selectivity is estimated as by applying includsion-exclusion principle.

In Range predicates such as , the selectivity is estimated as assuming uniform distribution. If , then size is 0. Conditions such as can be estimated symmetrically.

In join size estimation with instance such as , take as the size of join result.

This estimation is a very rough one and may be very inaccurate in practice, and skewness is one of the major reasons. The assumption of independence may also not hold in many cases.

Historgrams in catalog can improve selectivity estimation by dividing values into different buckets with roughly equal number of tuples.

We will enumerate all possible physical plans and pick the cheapest one. The practical goal in this process is not to find the optimal plan but to avoid bad ones. Take searching for optimal join orders as an example.

For relations, there are different join orders. Exhaustive search is impractical for large .

System R reduces search space by only considering left-deep trees which enables generation of fully pipelined plans because logical plans are based on volcano model.

A dynamic programming algorithm is used to find the optimal left-deep join tree in a bottom-up manner. Firstly, we should find the optimal access path for each single relation. Then, we can build optimal plans for joining two relations based on the optimal single-relation plans. This process continues until we build the optimal plan for joining all relations.

Thereby, the cost can be expressed as .

However, this algorithm is not effective for interesting orders problem, which can be solved by increasing the state space to keep track of interesting orders.

Worst-Case Optimal Join Algorithms

This chapter will expand from triangle query.

Binary Join Plan

We assume in this query that , and are all of size . In the worst case, any binary join plan takes time .

Delay the Computation (DC)

The key idea of DC is to delay the computation of intermediate results as much as possible and process tuples only when can contribute to the final output.

1
2
3
4
5
for each a in π_A(R) do
for each b in π_B(σ_{A=a}(R) ⋈ S) do
for each c in π_C(σ_{B=b}(S) ⋈ T) do
if (a,c) in π_{A,C}(σ_{C=c}(T)) then
output (a,b,c)

The time complexity of DC is for any instance.

AGM Bound and WCOJ Algorithms

Consider a framework of general join queries over a set of relations and a set of attributes . Each relation is defined over a subset of attributes .

The hypergraph associated with is defined as follows:

  • The set of vertices .
  • The set of hyperedges .

A fractional edge cover of a hypergraph is a set of non-negative weights such that for every vertex , the sum of weights of edges incident to is at least 1, i.e., .

The minimum fractional edge cover of is .

The AGM bound states that the size of the join result is bounded by , where is a minimum fractional edge cover of the hypergraph . Specially, for all input relations of size , the size of the join result is $O(N^{\rho^})\rho^\mathcal{H}(Q)$. It can be proved that AGM bound is tight.

Generic Join Algorithm

1
2
3
4
5
6
7
8
9
10
11
12
13
function Generic-Join(Attrs, Relations, Partial-Tuple)
if Attrs is empty then
output Partial-Tuple
return
A := choose an attribute from Attrs
for each value a in ⋂_{R∈Relations, A∈attr(R)} π_A(σ_{Partial-Tuple}(R)) do
New-Relations := {}
for each R in Relations do
if A in attr(R) then
New-Relations := New-Relations ∪ {σ_{A=a}(R)}
else
New-Relations := New-Relations ∪ {R}
Generic-Join(Attrs - {A}, New-Relations, Partial-Tuple ∪ {(A,a)})