CMU 15721

Lecture Note 10. Multi-Way Join Algorithms

Binary join’s performance decreases as the join’s output is larger than its inputs. Worst-case optimal joins (WCOJ) perform join by examining an attribute at a time instead of a relation at a time. The worst-case runtime of the algorithm meets a known lower bound for the worst-case runtime of any join algorithm.

WCOJ has the best runtime of all join algorithm when the query and data represent the worst possible scenario. These joins will be more common because the SQL 2023 standard includes property graph query extensions (see Peter Eisentraut’s overview). For example, finding a path in a graph involves a huge number of self joins.

Adopting Worst-Case Optimal Joins in Relational Database Systems (M. Freitag, et al., VLDB 2020) summarizes WCOJ algorithms and provides its optimized implementation. No TL;DR here.

Lecture Note 09. Hash Join Algorithms

Hash join is the most or one of the most important operator(s) in OLAP DBMS, although not the dominant cost. This lecture focus almost on hash join (partitioned vs non-partitioned), since (1) nested-loop join is always the worst for OLAP and (2) sort-merge join is slower the most of time.

The design goals of hash join on modern hardware are (1) minimize synchronization between threads or cores and (2) minimize memory access cost, especially for remote NUMA access.

Hash Join Phase #1: Partition (optional)

Approach #1: Non-Blocking Partitioning

Approach #2: Blocking Partitioning (Radix Hash Join)

  1. radix hash join step 1
  2. radix hash join step 2
  3. radix hash join step 3


Hash Join Phase #2: Build

Hash tables design decisions:

  1. Hash Function: CRC32, CRC64, xxHash, …
  2. Hashing Scheme: Chained / Linear / Robin Hood Hashing / Hopscotch Hashing / Cuckoo Hashing

Hash table content trade-offs:

Hash Join Phase #3: Probe

Use bloom filter (as we did everywhere).


Reading Review Don’t Hold My Data Hostage: A Case for Client Protocol Redesign (M. Raasveldt, et al., VLDB 2017)

Lecture Note 07.Code Generation & Compilation

Code Specialization

Generate task-specific code for CPU-intensive task that has a similar execution pattern on different inputs, including:

Approach #1: Transpilation

Debugging is relatively easy but it takes a relatively long time to compile.

Approach #2: JIT Compilation

HyPer does aggressive operator fusion within pipelines, and compile queries in-memory into native code using the LLVM toolkit.

Reading Review Efficiently Compiling Efficient Query Plans for Modern Hardware (T. Neumann, VLDB 2011)

Reading Review Make the Most out of Your SIMD Investments: Counter Control Flow Divergence in Compiled Query Pipelines (H. Lang, et al., VLDB Journal 2020)

Lecture Note 06. Vectorized Query Execution

Implementation Approaches

Implementing an algorithm using SIMD is still mostly a manual process.

Vectorization Fundamentals

Vectorized DBMS Algorithms

Takeaway: AVX-512 is not always faster than AVX2 due to clock speed downgrade.

Lecture Note 05. Query Execution & Processing II

This lecture is a quick overview of more design considerations when building an execution engine.

Parallel Execution

Operator Output

Internal Representation

The DBMS cannot use the on-disk file format (Parquet or ORC) for execution directly. Instead, it converts the data into an in-memory internal representation and then propagates the data through the query plan.

Apache Arrow is the best choice for internal data representation (Andy’s opinion). It’s a self-describing, language-agnostic in-memory columnar data format, which is optimized for cache-efficient and vectorized execution engines.

Expression Evaluation

Instead of traversing the expression tree for each tuple, Velox uses JIT compilation to evaluate the expression directly.

Adaptive Execution

Adaptive Execution is a technique that allows to modify a query’s plan and expression tree on execution based on the information gathered during the execution.

Tricks used in Velox:

Reading Review Velox: Meta’s Unified Execution Engine (P. Pedreira, et al., VLDB 2022)

I request for my privilege to miss the reading review for MonetDB/X100 paper, since Andy said most OLAP DBs today followed the guidelines in that paper (no suprise, this paper is as old as me).

Lecture Note 04. Query Execution & Processing I

DBMS engineering is an orchestration of a bunch of optimizations and not a single one is more important than the others.

Overview optimization goals:

MonetDB/X100 Analysis

Modern CPUs organize instructions into pipeline stages and support multiple pipelines. The problems in DBMS are (1)dependencies in instruction and (2)branch misprediction.

Example solutions:

Processing Models

Processing model = control flow + data flow.

Plan Processing Direction

Filter Representation

Vectorized / bottom-up execution almost always will be the better way to execute OLAP queries.

Lecture Note 03. Data Formats & Encoding II

I try to read the papers before class, but it’s hard to understand (FastLane in particular). I decide to watch the video first but just to find this lecture is mainly about that 3 papers…😅 Thus this note will just simply cover the main idea.

Nested Data Representation (from last class)

  1. Shredding
  2. Length + Presence

Critiques of Existing Formats

The Ideas from Papers

The ideas from the 3 papers to solve these challenges:

Lesson Learned

Logical-physical data independence allows hacking on physical representation/computation without logical interfaces changed.

Data parallelism via SIMD is an important tool.

Lecture Note 02. Data Formats & Encoding I

OLAP workloads tends to do sequential scans on large segments of read-only data, while OLTP workloads mostly find individual tuples.

Sequential scan optimizations:

Storage Models

  1. NSM (N-ary Storage Model)
    • tuple continuously
    • ideal for OLTP workloads (access individual entities / insert-heavy)
    • fixed page size
    • most OLTP DBMSs (PG, MySQL, Oracle)
  2. DSM (Decomposition Storage Model)
    • attribute continuously
    • ideal for OLAP (query a subset of attributes)
    • fixed-length offsets over embedded tuple IDs (How to convert variable-length data into fixed-length? Dictionary compression over padding)
  3. PAX (Partition Attributes Across)
    • hybrid, row groups + column chunks
    • faster processing + spatial locality

Persistent Data Format

Design decisions from Apache Parquet / Apache ORC:

Lessons learned:

Reading Review An Empirical Evaluation of Columnar Storage Formats (X. Zeng, et al., VLDB 2023)

Reading Review The Composable Data Management System Manifesto (P. Pedreira, et al., VLDB 2023)

Lecture Note 01. Modern Analytical Database Systems


  1. Data Cube (’90s)

    DBMSs maintain pre-computed aggregations to speed up queries.

  2. Data Warehouses (’00s)

    • Monolithic DBMSs
    • Shared-nonthing architectures
    • Column-oriented data storage
    • ETL data from OLTP databases into OLTP database
  3. Shared-Disk Engines (’10s)

    • Shared-disk architectures
    • Third-party distributed storage instead of custom storage manager
    • OLTP databases store data in object store (with catalog manager); OLAP query engine fetches data from that
  4. Lakehouse Systems (’20s)

    • SQL as well as non-SQL
    • decoupling data storage from DBMS
    • Unstructured / semi-structured data


(Distributed) Architecture

Archtecture Overview

Distributed OLAP query execution is roughly the same as that on single node. For each operator the DBMS considers where to fetch the input and where to send the output.

Reading Review Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics (M. Armbrust, et al., CIDR 2021)