12.2. Tables and Indexes
The examples in the previous section illustrated full text matching using simple constant strings. This section shows how to search table data, optionally using indexes.
12.2.1. Searching a Table
It is possible to do a full text search without an index. A simple query
to print the
title
of each row that contains the word
friend
in its
body
field is:
SELECT title FROM pgweb WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');
This will also find related words such as
friends
and
friendly
, since all these are reduced to the same
normalized lexeme.
The query above specifies that the
english
configuration
is to be used to parse and normalize the strings. Alternatively we
could omit the configuration parameters:
SELECT title FROM pgweb WHERE to_tsvector(body) @@ to_tsquery('friend');
This query will use the configuration set by default_text_search_config .
A more complex example is to
select the ten most recent documents that contain
create
and
table
in the
title
or
body
:
SELECT title FROM pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10;
For clarity we omitted the
coalesce
function calls
which would be needed to find rows that contain
NULL
in one of the two fields.
Although these queries will work without an index, most applications will find this approach too slow, except perhaps for occasional ad-hoc searches. Practical use of text searching usually requires creating an index.
12.2.2. Creating Indexes
We can create a GIN index ( Section 12.9 ) to speed up text searches:
CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', body));
Notice that the 2-argument version of
to_tsvector
is
used. Only text search functions that specify a configuration name can
be used in expression indexes (
Section 11.7
).
This is because the index contents must be unaffected by
default_text_search_config
. If they were affected, the
index contents might be inconsistent because different entries could
contain
tsvector
s that were created with different text search
configurations, and there would be no way to guess which was which. It
would be impossible to dump and restore such an index correctly.
Because the two-argument version of
to_tsvector
was
used in the index above, only a query reference that uses the 2-argument
version of
to_tsvector
with the same configuration
name will use that index. That is,
WHERE
to_tsvector('english', body) @@ 'a & b'
can use the index,
but
WHERE to_tsvector(body) @@ 'a & b'
cannot.
This ensures that an index will be used only with the same configuration
used to create the index entries.
It is possible to set up more complex expression indexes wherein the configuration name is specified by another column, e.g.:
CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector(config_name, body));
where
config_name
is a column in the
pgweb
table. This allows mixed configurations in the same index while
recording which configuration was used for each index entry. This
would be useful, for example, if the document collection contained
documents in different languages. Again,
queries that are meant to use the index must be phrased to match, e.g.,
WHERE to_tsvector(config_name, body) @@ 'a & b'
.
Indexes can even concatenate columns:
CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', title || ' ' || body));
Another approach is to create a separate
tsvector
column
to hold the output of
to_tsvector
. This example is a
concatenation of
title
and
body
,
using
coalesce
to ensure that one field will still be
indexed when the other is
NULL
:
ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector; UPDATE pgweb SET textsearchable_index_col = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));
Then we create a GIN index to speed up the search:
CREATE INDEX textsearch_idx ON pgweb USING GIN (textsearchable_index_col);
Now we are ready to perform a fast full text search:
SELECT title FROM pgweb WHERE textsearchable_index_col @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10;
When using a separate column to store the
tsvector
representation,
it is necessary to create a trigger to keep the
tsvector
column current anytime
title
or
body
changes.
Section 12.4.3
explains how to do that.
One advantage of the separate-column approach over an expression index
is that it is not necessary to explicitly specify the text search
configuration in queries in order to make use of the index. As shown
in the example above, the query can depend on
default_text_search_config
. Another advantage is that
searches will be faster, since it will not be necessary to redo the
to_tsvector
calls to verify index matches. (This is more
important when using a GiST index than a GIN index; see
Section 12.9
.) The expression-index approach is
simpler to set up, however, and it requires less disk space since the
tsvector
representation is not stored explicitly.