rules table
Name
rules table — The rules table contains a set of rules that maps address input sequence tokens to standardized output sequence. A rule is defined as a set of input tokens followed by -1 (terminator) followed by set of output tokens followed by -1 followed by number denoting kind of rule followed by ranking of rule.
Description
A rules table must have at least the following columns, though you are allowed to add more for your own uses.
- id
- 
     Primary key of table 
- rule
- 
     text field denoting the rule. Details at PAGC Address Standardizer Rule records . A rule consists of a set of non-negative integers representing input tokens, terminated by a -1, followed by an equal number of non-negative integers representing postal attributes, terminated by a -1, followed by an integer representing a rule type, followed by an integer representing the rank of the rule. The rules are ranked from 0 (lowest) to 17 (highest). So for example the rule 2 0 2 22 3 -1 5 5 6 7 3 -1 2 6maps to sequence of output tokens TYPE NUMBER TYPE DIRECT QUALIF to the output sequence STREET STREET SUFTYP SUFDIR QUALIF . The rule is an ARC_C rule of rank 6.Numbers for corresponding output tokens are listed in stdaddr . 
Input Tokens
   Each rule starts with a set of input tokens followed by a terminator
   
    -1
   
   . Valid input tokens excerpted from
   
    PAGC Input Tokens
   
   are as follows:
  
Form-Based Input Tokens
- AMPERS
- 
     (13). The ampersand (&) is frequently used to abbreviate the word "and". 
- DASH
- 
     (9). A punctuation character. 
- DOUBLE
- 
     (21). A sequence of two letters. Often used as identifiers. 
- FRACT
- 
     (25). Fractions are sometimes used in civic numbers or unit numbers. 
- MIXED
- 
     (23). An alphanumeric string that contains both letters and digits. Used for identifiers. 
- NUMBER
- 
     (0). A string of digits. 
- ORD
- 
     (15). Representations such as First or 1st. Often used in street names. 
- ORD
- 
     (18). A single letter. 
- WORD
- 
     (1). A word is a string of letters of arbitrary length. A single letter can be both a SINGLE and a WORD. 
Function-based Input Tokens
- BOXH
- 
     (14). Words used to denote post office boxes. For example Box or PO Box . 
- BUILDH
- 
     (19). Words used to denote buildings or building complexes, usually as a prefix. For example: Tower in Tower 7A . 
- BUILDT
- 
     (24). Words and abbreviations used to denote buildings or building complexes, usually as a suffix. For example: Shopping Centre . 
- DIRECT
- 
     (22). Words used to denote directions, for example North . 
- MILE
- 
     (20). Words used to denote milepost addresses. 
- ROAD
- 
     (6). Words and abbreviations used to denote highways and roads. For example: the Interstate in Interstate 5 
- RR
- 
     (8). Words and abbreviations used to denote rural routes. RR . 
- TYPE
- 
     (2). Words and abbreviation used to denote street typess. For example: ST or AVE . 
- UNITH
- 
     (16). Words and abbreviation used to denote internal subaddresses. For example, APT or UNIT . 
Postal Type Input Tokens
- QUINT
- 
     (28). A 5 digit number. Identifies a Zip Code 
- QUAD
- 
     (29). A 4 digit number. Identifies ZIP4. 
- PCH
- 
     (27). A 3 character sequence of letter number letter. Identifies an FSA, the first 3 characters of a Canadian postal code. 
- PCT
- 
     (26). A 3 character sequence of number letter number. Identifies an LDU, the last 3 characters of a Canadian postal code. 
Stopwords
STOPWORDS combine with WORDS. In rules a string of multiple WORDs and STOPWORDs will be represented by a single WORD token.
- STOPWORD
- 
     (7). A word with low lexical significance, that can be omitted in parsing. For example: THE . 
Output Tokens
   After the first -1 (terminator), follows the output tokens and their order, followed by a terminator
   
    -1
   
   .  Numbers for corresponding output tokens are listed in
   
    stdaddr
   
   . What are allowed is dependent on kind of rule.  Output tokens valid for each rule type are listed in
   
    the section called “Rule Types and Rank”
   
   .
  
Rule Types and Rank
The final part of the rule is the rule type which is denoted by one of the following, followed by a rule rank. The rules are ranked from 0 (lowest) to 17 (highest).
MACRO_C
(token number = " 0 "). The class of rules for parsing MACRO clauses such as PLACE STATE ZIP
MACRO_C output tokens (excerpted from http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ-- .
- CITY
- 
     (token number "10"). Example "Albany" 
- STATE
- 
     (token number "11"). Example "NY" 
- NATION
- 
     (token number "12"). This attribute is not used in most reference files. Example "USA" 
- POSTAL
- 
     (token number "13"). (SADS elements "ZIP CODE" , "PLUS 4" ). This attribute is used for both the US Zip and the Canadian Postal Codes. 
MICRO_C
(token number = " 1 "). The class of rules for parsing full MICRO clauses (such as House, street, sufdir, predir, pretyp, suftype, qualif) (ie ARC_C plus CIVIC_C). These rules are not used in the build phase.
MICRO_C output tokens (excerpted from http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ-- .
- HOUSE
- 
     is a text (token number 1): This is the street number on a street. Example 75 in75 State Street.
- predir
- 
     is text (token number 2): STREET NAME PRE-DIRECTIONAL such as North, South, East, West etc.
- qual
- 
     is text (token number 3): STREET NAME PRE-MODIFIER Example OLD in3715 OLD HIGHWAY 99.
- pretype
- 
     is text (token number 4): STREET PREFIX TYPE
- street
- 
     is text (token number 5): STREET NAME
- suftype
- 
     is text (token number 6): STREET POST TYPE e.g. St, Ave, Cir. A street type following the root street name. Example STREET in75 State Street.
- sufdir
- 
     is text (token number 7): STREET POST-DIRECTIONAL A directional modifier that follows the street name.. Example WEST in3715 TENTH AVENUE WEST.
ARC_C
(token number = " 2 "). The class of rules for parsing MICRO clauses, excluding the HOUSE attribute. As such uses same set of output tokens as MICRO_C minus the HOUSE token.
CIVIC_C
(token number = " 3 "). The class of rules for parsing the HOUSE attribute.
EXTRA_C
(token number = " 4 "). The class of rules for parsing EXTRA attributes - attributes excluded from geocoding. These rules are not used in the build phase.
EXTRA_C output tokens (excerpted from http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ-- .
- BLDNG
- 
     (token number 0): Unparsed building identifiers and types.
- BOXH
- 
     (token number 14): The BOX inBOX 3B
- BOXT
- 
     (token number 15): The 3B inBOX 3B
- RR
- 
     (token number 8): The RR inRR 7
- UNITH
- 
     (token number 16): The APT inAPT 3B
- UNITT
- 
     (token number 17): The 3B inAPT 3B
- UNKNWN
- 
     (token number 9): An otherwise unclassified output.