Index-Only Scans
PostgreSQL 9.6.2 Documentation | |||
---|---|---|---|
Prev | Up | Chapter 11. Indexes | Next |
All indexes in PostgreSQL are secondary indexes, meaning that each index is stored separately from the table's main data area (which is called the table's heap in PostgreSQL terminology). This means that in an ordinary index scan, each row retrieval requires fetching data from both the index and the heap. Furthermore, while the index entries that match a given indexable WHERE condition are usually close together in the index, the table rows they reference might be anywhere in the heap. The heap-access portion of an index scan thus involves a lot of random access into the heap, which can be slow, particularly on traditional rotating media. (As described in Section 11.5 , bitmap scans try to alleviate this cost by doing the heap accesses in sorted order, but that only goes so far.)
To solve this performance problem, PostgreSQL supports index-only scans , which can answer queries from an index alone without any heap access. The basic idea is to return values directly out of each index entry instead of consulting the associated heap entry. There are two fundamental restrictions on when this method can be used:
-
The index type must support index-only scans. B-tree indexes always do. GiST and SP-GiST indexes support index-only scans for some operator classes but not others. Other index types have no support. The underlying requirement is that the index must physically store, or else be able to reconstruct, the original data value for each index entry. As a counterexample, GIN indexes cannot support index-only scans because each index entry typically holds only part of the original data value.
-
The query must reference only columns stored in the index. For example, given an index on columns x and y of a table that also has a column z , these queries could use index-only scans:
SELECT x, y FROM tab WHERE x = 'key'; SELECT x FROM tab WHERE x = 'key' AND y < 42;
but these queries could not:
SELECT x, z FROM tab WHERE x = 'key'; SELECT x FROM tab WHERE x = 'key' AND z < 42;
(Expression indexes and partial indexes complicate this rule, as discussed below.)
If these two fundamental requirements are met, then all the data values required by the query are available from the index, so an index-only scan is physically possible. But there is an additional requirement for any table scan in PostgreSQL : it must verify that each retrieved row be "visible" to the query's MVCC snapshot, as discussed in Chapter 13 . Visibility information is not stored in index entries, only in heap entries; so at first glance it would seem that every row retrieval would require a heap access anyway. And this is indeed the case, if the table row has been modified recently. However, for seldom-changing data there is a way around this problem. PostgreSQL tracks, for each page in a table's heap, whether all rows stored in that page are old enough to be visible to all current and future transactions. This information is stored in a bit in the table's visibility map . An index-only scan, after finding a candidate index entry, checks the visibility map bit for the corresponding heap page. If it's set, the row is known visible and so the data can be returned with no further work. If it's not set, the heap entry must be visited to find out whether it's visible, so no performance advantage is gained over a standard index scan. Even in the successful case, this approach trades visibility map accesses for heap accesses; but since the visibility map is four orders of magnitude smaller than the heap it describes, far less physical I/O is needed to access it. In most situations the visibility map remains cached in memory all the time.
In short, while an index-only scan is possible given the two fundamental requirements, it will be a win only if a significant fraction of the table's heap pages have their all-visible map bits set. But tables in which a large fraction of the rows are unchanging are common enough to make this type of scan very useful in practice.
To make effective use of the index-only scan feature, you might choose to create indexes in which only the leading columns are meant to match WHERE clauses, while the trailing columns hold "payload" data to be returned by a query. For example, if you commonly run queries like
SELECT y FROM tab WHERE x = 'key';
the traditional approach to speeding up such queries would be to create an index on x only. However, an index on (x, y) would offer the possibility of implementing this query as an index-only scan. As previously discussed, such an index would be larger and hence more expensive than an index on x alone, so this is attractive only if the table is known to be mostly static. Note it's important that the index be declared on (x, y) not (y, x) , as for most index types (particularly B-trees) searches that do not constrain the leading index columns are not very efficient.
In principle, index-only scans can be used with expression indexes. For example, given an index on f(x) where x is a table column, it should be possible to execute
SELECT f(x) FROM tab WHERE f(x) < 1;
as an index-only scan; and this is very attractive if f() is an expensive-to-compute function. However, PostgreSQL 's planner is currently not very smart about such cases. It considers a query to be potentially executable by index-only scan only when all columns needed by the query are available from the index. In this example, x is not needed except in the context f(x) , but the planner does not notice that and concludes that an index-only scan is not possible. If an index-only scan seems sufficiently worthwhile, this can be worked around by declaring the index to be on (f(x), x) , where the second column is not expected to be used in practice but is just there to convince the planner that an index-only scan is possible. An additional caveat, if the goal is to avoid recalculating f(x) , is that the planner won't necessarily match uses of f(x) that aren't in indexable WHERE clauses to the index column. It will usually get this right in simple queries such as shown above, but not in queries that involve joins. These deficiencies may be remedied in future versions of PostgreSQL .
Partial indexes also have interesting interactions with index-only scans. Consider the partial index shown in Example 11-3 :
CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target) WHERE success;
In principle, we could do an index-only scan on this index to satisfy a query like
SELECT target FROM tests WHERE subject = 'some-subject' AND success;
But there's a problem: the WHERE clause refers to success which is not available as a result column of the index. Nonetheless, an index-only scan is possible because the plan does not need to recheck that part of the WHERE clause at run time: all entries found in the index necessarily have success = true so this need not be explicitly checked in the plan. PostgreSQL versions 9.6 and later will recognize such cases and allow index-only scans to be generated, but older versions will not.