Why Postgres times out on TPC-H Q17 and Q20?

TL;DR: The Postgres optimizer lacks subquery unnesting.

In recent years, I’ve been working on boosting the analytical capabilities of OLTP databases like Postgres. The most popular approach is embedding a DuckDB instance. To explain why this is worthwhile, I keep referring to the following TPC-H benchmark from the DuckDB blog. Notice vanilla Postgres times out on Q17 and Q20.

query	duckdb	duckdb/postgres	postgres
1	0.03	0.74	1.12
2	0.01	0.20	0.18
3	0.02	0.55	0.21
4	0.03	0.52	0.11
5	0.02	0.70	0.13
6	0.01	0.24	0.21
7	0.04	0.56	0.20
8	0.02	0.74	0.18
9	0.05	1.34	0.61
10	0.04	0.41	0.35
11	0.01	0.15	0.07
12	0.01	0.27	0.36
13	0.04	0.18	0.32
14	0.01	0.19	0.21
15	0.03	0.36	0.46
16	0.03	0.09	0.12
17	0.05	0.75	>60.00
18	0.08	0.97	1.05
19	0.03	0.32	0.31
20	0.05	0.37	>60.00
21	0.09	1.53	0.35
22	0.03	0.15	0.15

So what’s going on with these two queries? Let’s dig into Q17.

SELECT
    sum(l_extendedprice) / 7.0 AS avg_yearly
FROM
    lineitem,
    part
WHERE
    p_partkey = l_partkey
    AND p_brand = 'Brand#23'
    AND p_container = 'MED BOX'
    AND l_quantity < (
        SELECT
            0.2 * avg(l_quantity)
        FROM
            lineitem
        WHERE
            l_partkey = p_partkey);

The problem is the correlated subquery referring p_partkey from the outer query. In Postgres, this subquery runs for every row of part, resulting in an intermediate table. Extremely inefficient.

Aggregate  (cost=24.54..24.55 rows=1 width=32)
  ->  Hash Join  (cost=12.26..24.54 rows=1 width=18)
        Hash Cond: (lineitem.l_partkey = part.p_partkey)
        Join Filter: (lineitem.l_quantity < (SubPlan 1))
        ->  Seq Scan on lineitem  (cost=0.00..11.80 rows=180 width=40)
        ->  Hash  (cost=12.25..12.25 rows=1 width=4)
              ->  Seq Scan on part  (cost=0.00..12.25 rows=1 width=4)
                    Filter: ((p_brand = 'Brand#23'::bpchar) AND (p_container = 'MED BOX'::bpchar))
        SubPlan 1
          ->  Aggregate  (cost=12.25..12.27 rows=1 width=32)
                ->  Seq Scan on lineitem lineitem_1  (cost=0.00..12.25 rows=1 width=18)
                      Filter: (l_partkey = part.p_partkey)

DuckDB handles this differently by unnesting correlated subqueries. It replaces the correlated subquery with a join, bringing complexity down to .

DuckDB Q17 plan

This optimization technique originates from the paper Unnesting arbitrary queries (2015). As a follow-up, A Formalization of Top-Down Unnesting (2024) provides a formal proof of correctness for the unnesting approach presented in the 2015 paper and extends it to a top-down algorithm. Today, (almost) every modern OLAP database and engine implements this optimization.

Q20 suffers from the same issue in Postgres. Skip here.

Let’s circle back to the benchmark. You might’ve noticed that while DuckDB beats Postgres on all queries, the DuckDB-Postgres connector (duckdb_pg) doesn’t always beat vanilla Postgres. Same thing happens with DuckDB-in-Postgres pg_duckdb (name so confusing 😂).

The performance gap comes from all the modern techniques DuckDB uses: vectorized execution, morsel-based parallelism, columnar storage, you name it. But when query plans are of the same complexity, the execution speedup gets eaten by the overhead of data conversion between row format and columnar format. For complex queries like Q17 and Q20 though, the conversion cost is worth it for a much better query plan.