Testing and Debugging Text Search
PostgreSQL 9.6.2 Documentation | |||
---|---|---|---|
Prev | Up | Chapter 12. Full Text Search | Next |
The behavior of a custom text search configuration can easily become confusing. The functions described in this section are useful for testing text search objects. You can test a complete configuration, or test parsers and dictionaries separately.
12.8.1. Configuration Testing
The function
ts_debug
allows easy testing of a
text search configuration.
ts_debug([ config regconfig, ] document text,
OUT alias text,
OUT description text,
OUT token text,
OUT dictionaries regdictionary[],
OUT dictionary regdictionary,
OUT lexemes text[])
returns setof record
ts_debug
displays information about every token of
document
as produced by the
parser and processed by the configured dictionaries. It uses the
configuration specified by
config
,
or
default_text_search_config
if that argument is
omitted.
ts_debug
returns one row for each token identified in the text
by the parser. The columns returned are
-
alias text - short name of the token type
-
description text - description of the token type
-
token text - text of the token
-
dictionaries regdictionary[] - the dictionaries selected by the configuration for this token type
-
dictionary regdictionary - the dictionary that recognized the token, or NULL if none did
-
lexemes text[] - the lexeme(s) produced by the dictionary that recognized the token, or NULL if none did; an empty array ( {} ) means it was recognized as a stop word
Here is a simple example:
SELECT * FROM ts_debug('english','a fat cat sat on a mat - it ate a fat rats'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | on | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat} blank | Space symbols | | {} | | blank | Space symbols | - | {} | | asciiword | Word, all ASCII | it | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate} blank | Space symbols | | {} | | asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat}
For a more extensive demonstration, we first create a public.english configuration and Ispell dictionary for the English language:
CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english ); CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, DictFile = english, AffFile = english, StopWords = english ); ALTER TEXT SEARCH CONFIGURATION public.english ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
SELECT * FROM ts_debug('public.english','The Brightest supernovaes'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------------+-------------------------------+----------------+------------- asciiword | Word, all ASCII | The | {english_ispell,english_stem} | english_ispell | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | Brightest | {english_ispell,english_stem} | english_ispell | {bright} blank | Space symbols | | {} | | asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem | {supernova}
In this example, the word Brightest was recognized by the parser as an ASCII word (alias asciiword ). For this token type the dictionary list is english_ispell and english_stem . The word was recognized by english_ispell , which reduced it to the noun bright . The word supernovaes is unknown to the english_ispell dictionary so it was passed to the next dictionary, and, fortunately, was recognized (in fact, english_stem is a Snowball dictionary which recognizes everything; that is why it was placed at the end of the dictionary list).
The word The was recognized by the english_ispell dictionary as a stop word ( Section 12.6.1 ) and will not be indexed. The spaces are discarded too, since the configuration provides no dictionaries at all for them.
You can reduce the width of the output by explicitly specifying which columns you want to see:
SELECT alias, token, dictionary, lexemes FROM ts_debug('public.english','The Brightest supernovaes'); alias | token | dictionary | lexemes -----------+-------------+----------------+------------- asciiword | The | english_ispell | {} blank | | | asciiword | Brightest | english_ispell | {bright} blank | | | asciiword | supernovaes | english_stem | {supernova}
12.8.2. Parser Testing
The following functions allow direct testing of a text search parser.
ts_parse(parser_name text, document text, OUT tokid integer, OUT token text) returns setof record ts_parse(parser_oid oid, document text, OUT tokid integer, OUT token text) returns setof record
ts_parse
parses the given
document
and returns a series of records, one for each token produced by
parsing. Each record includes a
tokid
showing the
assigned token type and a
token
which is the text of the
token. For example:
SELECT * FROM ts_parse('default', '123 - a number'); tokid | token -------+-------- 22 | 123 12 | 12 | - 1 | a 12 | 1 | number
ts_token_type(parser_name text, OUT tokid integer, OUT alias text, OUT description text) returns setof record ts_token_type(parser_oid oid, OUT tokid integer, OUT alias text, OUT description text) returns setof record
ts_token_type
returns a table which describes each type of
token the specified parser can recognize. For each token type, the table
gives the integer
tokid
that the parser uses to label a
token of that type, the
alias
that names the token type
in configuration commands, and a short
description
. For
example:
SELECT * FROM ts_token_type('default'); tokid | alias | description -------+-----------------+------------------------------------------ 1 | asciiword | Word, all ASCII 2 | word | Word, all letters 3 | numword | Word, letters and digits 4 | email | Email address 5 | url | URL 6 | host | Host 7 | sfloat | Scientific notation 8 | version | Version number 9 | hword_numpart | Hyphenated word part, letters and digits 10 | hword_part | Hyphenated word part, all letters 11 | hword_asciipart | Hyphenated word part, all ASCII 12 | blank | Space symbols 13 | tag | XML tag 14 | protocol | Protocol head 15 | numhword | Hyphenated word, letters and digits 16 | asciihword | Hyphenated word, all ASCII 17 | hword | Hyphenated word, all letters 18 | url_path | URL path 19 | file | File or path name 20 | float | Decimal notation 21 | int | Signed integer 22 | uint | Unsigned integer 23 | entity | XML entity
12.8.3. Dictionary Testing
The
ts_lexize
function facilitates dictionary testing.
ts_lexize(dict regdictionary, token text) returns text[]
ts_lexize
returns an array of lexemes if the input
token
is known to the dictionary,
or an empty array if the token
is known to the dictionary but it is a stop word, or
NULL
if it is an unknown word.
Examples:
SELECT ts_lexize('english_stem', 'stars'); ts_lexize ----------- {star} SELECT ts_lexize('english_stem', 'a'); ts_lexize ----------- {}
Note: The
ts_lexize
function expects a single token , not text. Here is a case where this can be confusing:SELECT ts_lexize('thesaurus_astro','supernovae stars') is null; ?column? ---------- tThe thesaurus dictionary thesaurus_astro does know the phrase supernovae stars , but
ts_lexize
fails since it does not parse the input text but treats it as a single token. Useplainto_tsquery
orto_tsvector
to test thesaurus dictionaries, for example:SELECT plainto_tsquery('supernovae stars'); plainto_tsquery ----------------- 'sn'