8.11. Text Search Types
  
   PostgreSQL
  
  provides two data types that
    are designed to support full text search, which is the activity of
    searching through a collection of natural-language
  
   documents
  
  to locate those that best match a
  
   query
  
  .
    The
  
   tsvector
  
  type represents a document in a form optimized
    for text search; the
  
   tsquery
  
  type similarly represents
    a text query.
  
   Chapter 12
  
  provides a detailed explanation of this
    facility, and
  
   Section 9.13
  
  summarizes the
    related functions and operators.
 
      8.11.1. 
      
       tsvector
      
     
    
   A
   
    tsvector
   
   value is a sorted list of distinct
   
    lexemes
   
   , which are words that have been
   
    normalized
   
   to merge different variants of the same word
     (see
   
    Chapter 12
   
   for details).  Sorting and
     duplicate-elimination are done automatically during input, as shown in
     this example:
  
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
                      tsvector
----------------------------------------------------
 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
  To represent lexemes containing whitespace or punctuation, surround them with quotes:
SELECT $$the lexeme '    ' contains spaces$$::tsvector;
                 tsvector                  
-------------------------------------------
 '    ' 'contains' 'lexeme' 'spaces' 'the'
  (We use dollar-quoted string literals in this example and the next one to avoid the confusion of having to double quote marks within the literals.) Embedded quotes and backslashes must be doubled:
SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
                    tsvector                    
------------------------------------------------
 'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'
  Optionally, integer positions can be attached to lexemes:
SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
                                  tsvector
------------------------------------------------------------------- ------------
 'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
  A position normally indicates the source word's location in the document. Positional information can be used for proximity ranking . Position values can range from 1 to 16383; larger numbers are silently set to 16383. Duplicate positions for the same lexeme are discarded.
   Lexemes that have positions can further be labeled with a
   
    weight
   
   , which can be
   
    A
   
   ,
   
    B
   
   ,
   
    C
   
   , or
   
    D
   
   .
   
    D
   
   is the default and hence is not shown on output:
  
SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
          tsvector          
----------------------------
 'a':1A 'cat':5 'fat':2B,4C
  Weights are typically used to reflect document structure, for example by marking title words differently from body words. Text search ranking functions can assign different priorities to the different weight markers.
   It is important to understand that the
   
    tsvector
   
   type itself does not perform any word
     normalization; it assumes the words it is given are normalized
     appropriately for the application.  For example,
  
SELECT 'The Fat Rats'::tsvector;
      tsvector      
--------------------
 'Fat' 'Rats' 'The'
  
   For most English-text-searching applications the above words would
     be considered non-normalized, but
   
    tsvector
   
   doesn't care.
     Raw document text should usually be passed through
   
    to_tsvector
   
   to normalize the words appropriately
     for searching:
  
SELECT to_tsvector('english', 'The Fat Rats');
   to_tsvector   
-----------------
 'fat':2 'rat':3
  Again, see Chapter 12 for more detail.
      8.11.2. 
      
       tsquery
      
     
    
   A
   
    tsquery
   
   value stores lexemes that are to be
     searched for, and can combine them using the Boolean operators
   
    &
   
   (AND),
   
    |
   
   (OR), and
   
    !
   
   (NOT), as well as the phrase search operator
   
    <->
   
   (FOLLOWED BY).  There is also a variant
   
    <
    
     
   of the FOLLOWED BY
     operator, where
   
    
      N
     
    
    >
   
     N
    
   
   is an integer constant that
     specifies the distance between the two lexemes being searched
     for.
   
    <->
   
   is equivalent to
   
    <1>
   
   .
  
   Parentheses can be used to enforce grouping of these operators.
     In the absence of parentheses,
   
    !
   
   (NOT) binds most tightly,
   
    <->
   
   (FOLLOWED BY) next most tightly, then
   
    &
   
   (AND), with
   
    |
   
   (OR) binding
     the least tightly.
  
Here are some examples:
SELECT 'fat & rat'::tsquery;
    tsquery    
---------------
 'fat' & 'rat'
SELECT 'fat & (rat | cat)'::tsquery;
          tsquery          
---------------------------
 'fat' & ( 'rat' | 'cat' )
SELECT 'fat & rat & ! cat'::tsquery;
        tsquery         
------------------------
 'fat' & 'rat' & !'cat'
  
   Optionally, lexemes in a
   
    tsquery
   
   can be labeled with
     one or more weight letters, which restricts them to match only
   
    tsvector
   
   lexemes with one of those weights:
  
SELECT 'fat:ab & cat'::tsquery;
    tsquery
------------------
 'fat':AB & 'cat'
  
   Also, lexemes in a
   
    tsquery
   
   can be labeled with
   
    *
   
   to specify prefix matching:
  
SELECT 'super:*'::tsquery; tsquery ----------- 'super':*
   This query will match any word in a
   
    tsvector
   
   that begins
     with
   
    "
    
     super
    
    "
   
   .
  
   Quoting rules for lexemes are the same as described previously for
     lexemes in
   
    tsvector
   
   ; and, as with
   
    tsvector
   
   ,
     any required normalization of words must be done before converting
     to the
   
    tsquery
   
   type.  The
   
    to_tsquery
   
   function is convenient for performing such normalization:
  
SELECT to_tsquery('Fat:ab & Cats');
    to_tsquery    
------------------
 'fat':AB & 'cat'
  
   Note that
   
    to_tsquery
   
   will process prefixes in the same way
     as other words, which means this comparison returns true:
  
SELECT to_tsvector( 'postgraduate' ) @@ to_tsquery( 'postgres:*' ); ?column? ---------- t
   because
   
    postgres
   
   gets stemmed to
   
    postgr
   
   :
  
SELECT to_tsvector( 'postgraduate' ), to_tsquery( 'postgres:*' ); to_tsvector | to_tsquery ---------------+------------ 'postgradu':1 | 'postgr':*
   which will match the stemmed form of
   
    postgraduate
   
   .