# 70.2. Multivariate Statistics Examples

## 70.2.1. Functional Dependencies

Multivariate correlation can be demonstrated with a very simple data set - a table with two columns, both containing the same values:

CREATE TABLE t (a INT, b INT); INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i); ANALYZE t;

As explained in
Section 14.2
, the planner can determine
cardinality of
```
t
```

using the number of pages and
rows obtained from
```
pg_class
```

:

SELECT relpages, reltuples FROM pg_class WHERE relname = 't'; relpages | reltuples ----------+----------- 45 | 10000

The data distribution is very simple; there are only 100 distinct values in each column, uniformly distributed.

The following example shows the result of estimating a
```
WHERE
```

condition on the
```
a
```

column:

EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1; QUERY PLAN ------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual rows=100 loops=1) Filter: (a = 1) Rows Removed by Filter: 9900

The planner examines the condition and determines the selectivity
of this clause to be 1%. By comparing this estimate and the actual
number of rows, we see that the estimate is very accurate
(in fact exact, as the table is very small). Changing the
```
WHERE
```

condition to use the
```
b
```

column, an
identical plan is generated. But observe what happens if we apply the same
condition on both columns, combining them with
```
AND
```

:

EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1; QUERY PLAN ----------------------------------------------------------------------------- Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=100 loops=1) Filter: ((a = 1) AND (b = 1)) Rows Removed by Filter: 9900

The planner estimates the selectivity for each condition individually, arriving at the same 1% estimates as above. Then it assumes that the conditions are independent, and so it multiplies their selectivities, producing a final selectivity estimate of just 0.01%. This is a significant underestimate, as the actual number of rows matching the conditions (100) is two orders of magnitude higher.

This problem can be fixed by creating a statistics object that
directs
```
ANALYZE
```

to calculate functional-dependency
multivariate statistics on the two columns:

CREATE STATISTICS stts (dependencies) ON a, b FROM t; ANALYZE t; EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1; QUERY PLAN ------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1) Filter: ((a = 1) AND (b = 1)) Rows Removed by Filter: 9900

## 70.2.2. Multivariate N-Distinct Counts

A similar problem occurs with estimation of the cardinality of sets of
multiple columns, such as the number of groups that would be generated by
a
```
GROUP BY
```

clause. When
```
GROUP BY
```

lists a single column, the n-distinct estimate (which is visible as the
estimated number of rows returned by the HashAggregate node) is very
accurate:

EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a; QUERY PLAN ----------------------------------------------------------------------------------------- HashAggregate (cost=195.00..196.00 rows=100 width=12) (actual rows=100 loops=1) Group Key: a -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000 loops=1)

But without multivariate statistics, the estimate for the number of
groups in a query with two columns in
```
GROUP BY
```

, as
in the following example, is off by an order of magnitude:

EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b; QUERY PLAN -------------------------------------------------------------------------------------------- HashAggregate (cost=220.00..230.00 rows=1000 width=16) (actual rows=100 loops=1) Group Key: a, b -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)

By redefining the statistics object to include n-distinct counts for the two columns, the estimate is much improved:

DROP STATISTICS stts; CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t; ANALYZE t; EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b; QUERY PLAN -------------------------------------------------------------------------------------------- HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100 loops=1) Group Key: a, b -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)