rules table
Name
rules table — The rules table contains a set of rules that maps address input sequence tokens to standardized output sequence. A rule is defined as a set of input tokens followed by -1 (terminator) followed by set of output tokens followed by -1 followed by number denoting kind of rule followed by ranking of rule.
Description
A rules table must have at least the following columns, though you are allowed to add more for your own uses.
-
id -
Primary key of table
-
rule -
text field denoting the rule. Details at PAGC Address Standardizer Rule records .
A rule consists of a set of non-negative integers representing input tokens, terminated by a -1, followed by an equal number of non-negative integers representing postal attributes, terminated by a -1, followed by an integer representing a rule type, followed by an integer representing the rank of the rule. The rules are ranked from 0 (lowest) to 17 (highest).
So for example the rule
2 0 2 22 3 -1 5 5 6 7 3 -1 2 6maps to sequence of output tokens TYPE NUMBER TYPE DIRECT QUALIF to the output sequence STREET STREET SUFTYP SUFDIR QUALIF . The rule is an ARC_C rule of rank 6.Numbers for corresponding output tokens are listed in stdaddr .
Input Tokens
Each rule starts with a set of input tokens followed by a terminator
-1
. Valid input tokens excerpted from
PAGC Input Tokens
are as follows:
Form-Based Input Tokens
-
AMPERS -
(13). The ampersand (&) is frequently used to abbreviate the word "and".
-
DASH -
(9). A punctuation character.
-
DOUBLE -
(21). A sequence of two letters. Often used as identifiers.
-
FRACT -
(25). Fractions are sometimes used in civic numbers or unit numbers.
-
MIXED -
(23). An alphanumeric string that contains both letters and digits. Used for identifiers.
-
NUMBER -
(0). A string of digits.
-
ORD -
(15). Representations such as First or 1st. Often used in street names.
-
ORD -
(18). A single letter.
-
WORD -
(1). A word is a string of letters of arbitrary length. A single letter can be both a SINGLE and a WORD.
Function-based Input Tokens
-
BOXH -
(14). Words used to denote post office boxes. For example Box or PO Box .
-
BUILDH -
(19). Words used to denote buildings or building complexes, usually as a prefix. For example: Tower in Tower 7A .
-
BUILDT -
(24). Words and abbreviations used to denote buildings or building complexes, usually as a suffix. For example: Shopping Centre .
-
DIRECT -
(22). Words used to denote directions, for example North .
-
MILE -
(20). Words used to denote milepost addresses.
-
ROAD -
(6). Words and abbreviations used to denote highways and roads. For example: the Interstate in Interstate 5
-
RR -
(8). Words and abbreviations used to denote rural routes. RR .
-
TYPE -
(2). Words and abbreviations used to denote street types. For example: ST or AVE .
-
UNITH -
(16). Words and abbreviations used to denote internal subaddresses. For example, APT or UNIT .
Postal Type Input Tokens
-
QUINT -
(28). A 5 digit number. Identifies a Zip Code
-
QUAD -
(29). A 4 digit number. Identifies ZIP4.
-
PCH -
(27). A 3 character sequence of letter number letter. Identifies an FSA, the first 3 characters of a Canadian postal code.
-
PCT -
(26). A 3 character sequence of number letter number. Identifies an LDU, the last 3 characters of a Canadian postal code.
Stopwords
STOPWORDS combine with WORDS. In rules a string of multiple WORDs and STOPWORDs will be represented by a single WORD token.
-
STOPWORD -
(7). A word with low lexical significance, that can be omitted in parsing. For example: THE .
Output Tokens
After the first -1 (terminator), follows the output tokens and their order, followed by a terminator
-1
. Numbers for corresponding output tokens are listed in
stdaddr
. What are allowed is dependent on kind of rule. Output tokens valid for each rule type are listed in
the section called “Rule Types and Rank”
.
Rule Types and Rank
The final part of the rule is the rule type which is denoted by one of the following, followed by a rule rank. The rules are ranked from 0 (lowest) to 17 (highest).
MACRO_C
(token number = " 0 "). The class of rules for parsing MACRO clauses such as PLACE STATE ZIP
MACRO_C
output tokens
(excerpted from
http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--
.
-
CITY -
(token number "10"). Example "Albany"
-
STATE -
(token number "11"). Example "NY"
-
NATION -
(token number "12"). This attribute is not used in most reference files. Example "USA"
-
POSTAL -
(token number "13"). (SADS elements "ZIP CODE" , "PLUS 4" ). This attribute is used for both the US Zip and the Canadian Postal Codes.
MICRO_C
(token number = " 1 "). The class of rules for parsing full MICRO clauses (such as House, street, sufdir, predir, pretyp, suftype, qualif) (ie ARC_C plus CIVIC_C). These rules are not used in the build phase.
MICRO_C
output tokens
(excerpted from
http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--
.
-
HOUSE -
is a text (token number
1): This is the street number on a street. Example 75 in75 State Street. -
predir -
is text (token number
2): STREET NAME PRE-DIRECTIONAL such as North, South, East, West etc. -
qual -
is text (token number
3): STREET NAME PRE-MODIFIER Example OLD in3715 OLD HIGHWAY 99. -
pretype -
is text (token number
4): STREET PREFIX TYPE -
street -
is text (token number
5): STREET NAME -
suftype -
is text (token number
6): STREET POST TYPE e.g. St, Ave, Cir. A street type following the root street name. Example STREET in75 State Street. -
sufdir -
is text (token number
7): STREET POST-DIRECTIONAL A directional modifier that follows the street name.. Example WEST in3715 TENTH AVENUE WEST.
ARC_C
(token number = " 2 "). The class of rules for parsing MICRO clauses, excluding the HOUSE attribute. As such uses same set of output tokens as MICRO_C minus the HOUSE token.
CIVIC_C
(token number = " 3 "). The class of rules for parsing the HOUSE attribute.
EXTRA_C
(token number = " 4 "). The class of rules for parsing EXTRA attributes - attributes excluded from geocoding. These rules are not used in the build phase.
EXTRA_C
output tokens
(excerpted from
http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--
.
-
BLDNG -
(token number
0): Unparsed building identifiers and types. -
BOXH -
(token number
14): The BOX inBOX 3B -
BOXT -
(token number
15): The 3B inBOX 3B -
RR -
(token number
8): The RR inRR 7 -
UNITH -
(token number
16): The APT inAPT 3B -
UNITT -
(token number
17): The 3B inAPT 3B -
UNKNWN -
(token number
9): An otherwise unclassified output.