rules table
Name
rules table — The rules table contains a set of rules that maps address input sequence tokens to standardized output sequence. A rule is defined as a set of input tokens followed by -1 (terminator) followed by set of output tokens followed by -1 followed by number denoting kind of rule followed by ranking of rule.
Description
A rules table must have at least the following columns, though you are allowed to add more for your own uses.
- id
 - 
     
Primary key of table
 - rule
 - 
     
text field denoting the rule. Details at PAGC Address Standardizer Rule records .
A rule consists of a set of non-negative integers representing input tokens, terminated by a -1, followed by an equal number of non-negative integers representing postal attributes, terminated by a -1, followed by an integer representing a rule type, followed by an integer representing the rank of the rule. The rules are ranked from 0 (lowest) to 17 (highest).
So for example the rule
2 0 2 22 3 -1 5 5 6 7 3 -1 2 6maps to sequence of output tokens TYPE NUMBER TYPE DIRECT QUALIF to the output sequence STREET STREET SUFTYP SUFDIR QUALIF . The rule is an ARC_C rule of rank 6.Numbers for corresponding output tokens are listed in stdaddr .
 
Input Tokens
   Each rule starts with a set of input tokens followed by a terminator
   
    -1
   
   . Valid input tokens excerpted from
   
    PAGC Input Tokens
   
   are as follows:
  
Form-Based Input Tokens
- AMPERS
 - 
     
(13). The ampersand (&) is frequently used to abbreviate the word "and".
 - DASH
 - 
     
(9). A punctuation character.
 - DOUBLE
 - 
     
(21). A sequence of two letters. Often used as identifiers.
 - FRACT
 - 
     
(25). Fractions are sometimes used in civic numbers or unit numbers.
 - MIXED
 - 
     
(23). An alphanumeric string that contains both letters and digits. Used for identifiers.
 - NUMBER
 - 
     
(0). A string of digits.
 - ORD
 - 
     
(15). Representations such as First or 1st. Often used in street names.
 - ORD
 - 
     
(18). A single letter.
 - WORD
 - 
     
(1). A word is a string of letters of arbitrary length. A single letter can be both a SINGLE and a WORD.
 
Function-based Input Tokens
- BOXH
 - 
     
(14). Words used to denote post office boxes. For example Box or PO Box .
 - BUILDH
 - 
     
(19). Words used to denote buildings or building complexes, usually as a prefix. For example: Tower in Tower 7A .
 - BUILDT
 - 
     
(24). Words and abbreviations used to denote buildings or building complexes, usually as a suffix. For example: Shopping Centre .
 - DIRECT
 - 
     
(22). Words used to denote directions, for example North .
 - MILE
 - 
     
(20). Words used to denote milepost addresses.
 - ROAD
 - 
     
(6). Words and abbreviations used to denote highways and roads. For example: the Interstate in Interstate 5
 - RR
 - 
     
(8). Words and abbreviations used to denote rural routes. RR .
 - TYPE
 - 
     
(2). Words and abbreviation used to denote street typess. For example: ST or AVE .
 - UNITH
 - 
     
(16). Words and abbreviation used to denote internal subaddresses. For example, APT or UNIT .
 
Postal Type Input Tokens
- QUINT
 - 
     
(28). A 5 digit number. Identifies a Zip Code
 - QUAD
 - 
     
(29). A 4 digit number. Identifies ZIP4.
 - PCH
 - 
     
(27). A 3 character sequence of letter number letter. Identifies an FSA, the first 3 characters of a Canadian postal code.
 - PCT
 - 
     
(26). A 3 character sequence of number letter number. Identifies an LDU, the last 3 characters of a Canadian postal code.
 
Stopwords
STOPWORDS combine with WORDS. In rules a string of multiple WORDs and STOPWORDs will be represented by a single WORD token.
- STOPWORD
 - 
     
(7). A word with low lexical significance, that can be omitted in parsing. For example: THE .
 
Output Tokens
   After the first -1 (terminator), follows the output tokens and their order, followed by a terminator
   
    -1
   
   .  Numbers for corresponding output tokens are listed in
   
    stdaddr
   
   . What are allowed is dependent on kind of rule.  Output tokens valid for each rule type are listed in
   
    the section called “Rule Types and Rank”
   
   .
  
Rule Types and Rank
The final part of the rule is the rule type which is denoted by one of the following, followed by a rule rank. The rules are ranked from 0 (lowest) to 17 (highest).
MACRO_C
(token number = " 0 "). The class of rules for parsing MACRO clauses such as PLACE STATE ZIP
MACRO_C output tokens (excerpted from http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ-- .
- CITY
 - 
     
(token number "10"). Example "Albany"
 - STATE
 - 
     
(token number "11"). Example "NY"
 - NATION
 - 
     
(token number "12"). This attribute is not used in most reference files. Example "USA"
 - POSTAL
 - 
     
(token number "13"). (SADS elements "ZIP CODE" , "PLUS 4" ). This attribute is used for both the US Zip and the Canadian Postal Codes.
 
MICRO_C
(token number = " 1 "). The class of rules for parsing full MICRO clauses (such as House, street, sufdir, predir, pretyp, suftype, qualif) (ie ARC_C plus CIVIC_C). These rules are not used in the build phase.
MICRO_C output tokens (excerpted from http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ-- .
- HOUSE
 - 
     
is a text (token number
1): This is the street number on a street. Example 75 in75 State Street. - predir
 - 
     
is text (token number
2): STREET NAME PRE-DIRECTIONAL such as North, South, East, West etc. - qual
 - 
     
is text (token number
3): STREET NAME PRE-MODIFIER Example OLD in3715 OLD HIGHWAY 99. - pretype
 - 
     
is text (token number
4): STREET PREFIX TYPE - street
 - 
     
is text (token number
5): STREET NAME - suftype
 - 
     
is text (token number
6): STREET POST TYPE e.g. St, Ave, Cir. A street type following the root street name. Example STREET in75 State Street. - sufdir
 - 
     
is text (token number
7): STREET POST-DIRECTIONAL A directional modifier that follows the street name.. Example WEST in3715 TENTH AVENUE WEST. 
ARC_C
(token number = " 2 "). The class of rules for parsing MICRO clauses, excluding the HOUSE attribute. As such uses same set of output tokens as MICRO_C minus the HOUSE token.
CIVIC_C
(token number = " 3 "). The class of rules for parsing the HOUSE attribute.
EXTRA_C
(token number = " 4 "). The class of rules for parsing EXTRA attributes - attributes excluded from geocoding. These rules are not used in the build phase.
EXTRA_C output tokens (excerpted from http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ-- .
- BLDNG
 - 
     
(token number
0): Unparsed building identifiers and types. - BOXH
 - 
     
(token number
14): The BOX inBOX 3B - BOXT
 - 
     
(token number
15): The 3B inBOX 3B - RR
 - 
     
(token number
8): The RR inRR 7 - UNITH
 - 
     
(token number
16): The APT inAPT 3B - UNITT
 - 
     
(token number
17): The 3B inAPT 3B - UNKNWN
 - 
     
(token number
9): An otherwise unclassified output.