rules table
Name
rules table — The rules table contains a set of rules that maps address input sequence tokens to standardized output sequence. A rule is defined as a set of input tokens followed by -1 (terminator) followed by set of output tokens followed by -1 followed by number denoting kind of rule followed by ranking of rule.
Description
A rules table must have at least the following columns, though you are allowed to add more for your own uses.
-
id
-
Primary key of table
-
rule
-
text field denoting the rule. Details at PAGC Address Standardizer Rule records .
A rule consists of a set of non-negative integers representing input tokens, terminated by a -1, followed by an equal number of non-negative integers representing postal attributes, terminated by a -1, followed by an integer representing a rule type, followed by an integer representing the rank of the rule. The rules are ranked from 0 (lowest) to 17 (highest).
So for example the rule
2 0 2 22 3 -1 5 5 6 7 3 -1 2 6
maps to sequence of output tokens TYPE NUMBER TYPE DIRECT QUALIF to the output sequence STREET STREET SUFTYP SUFDIR QUALIF . The rule is an ARC_C rule of rank 6.Numbers for corresponding output tokens are listed in stdaddr .
Input Tokens
Each rule starts with a set of input tokens followed by a terminator
-1
. Valid input tokens excerpted from
PAGC Input Tokens
are as follows:
Form-Based Input Tokens
-
AMPERS
-
(13). The ampersand (&) is frequently used to abbreviate the word "and".
-
DASH
-
(9). A punctuation character.
-
DOUBLE
-
(21). A sequence of two letters. Often used as identifiers.
-
FRACT
-
(25). Fractions are sometimes used in civic numbers or unit numbers.
-
MIXED
-
(23). An alphanumeric string that contains both letters and digits. Used for identifiers.
-
NUMBER
-
(0). A string of digits.
-
ORD
-
(15). Representations such as First or 1st. Often used in street names.
-
ORD
-
(18). A single letter.
-
WORD
-
(1). A word is a string of letters of arbitrary length. A single letter can be both a SINGLE and a WORD.
Function-based Input Tokens
-
BOXH
-
(14). Words used to denote post office boxes. For example Box or PO Box .
-
BUILDH
-
(19). Words used to denote buildings or building complexes, usually as a prefix. For example: Tower in Tower 7A .
-
BUILDT
-
(24). Words and abbreviations used to denote buildings or building complexes, usually as a suffix. For example: Shopping Centre .
-
DIRECT
-
(22). Words used to denote directions, for example North .
-
MILE
-
(20). Words used to denote milepost addresses.
-
ROAD
-
(6). Words and abbreviations used to denote highways and roads. For example: the Interstate in Interstate 5
-
RR
-
(8). Words and abbreviations used to denote rural routes. RR .
-
TYPE
-
(2). Words and abbreviation used to denote street typess. For example: ST or AVE .
-
UNITH
-
(16). Words and abbreviation used to denote internal subaddresses. For example, APT or UNIT .
Postal Type Input Tokens
-
QUINT
-
(28). A 5 digit number. Identifies a Zip Code
-
QUAD
-
(29). A 4 digit number. Identifies ZIP4.
-
PCH
-
(27). A 3 character sequence of letter number letter. Identifies an FSA, the first 3 characters of a Canadian postal code.
-
PCT
-
(26). A 3 character sequence of number letter number. Identifies an LDU, the last 3 characters of a Canadian postal code.
Stopwords
STOPWORDS combine with WORDS. In rules a string of multiple WORDs and STOPWORDs will be represented by a single WORD token.
-
STOPWORD
-
(7). A word with low lexical significance, that can be omitted in parsing. For example: THE .
Output Tokens
After the first -1 (terminator), follows the output tokens and their order, followed by a terminator
-1
. Numbers for corresponding output tokens are listed in
stdaddr
. What are allowed is dependent on kind of rule. Output tokens valid for each rule type are listed in
the section called “Rule Types and Rank”
.
Rule Types and Rank
The final part of the rule is the rule type which is denoted by one of the following, followed by a rule rank. The rules are ranked from 0 (lowest) to 17 (highest).
MACRO_C
(token number = " 0 "). The class of rules for parsing MACRO clauses such as PLACE STATE ZIP
MACRO_C
output tokens
(excerpted from
http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--
.
-
CITY
-
(token number "10"). Example "Albany"
-
STATE
-
(token number "11"). Example "NY"
-
NATION
-
(token number "12"). This attribute is not used in most reference files. Example "USA"
-
POSTAL
-
(token number "13"). (SADS elements "ZIP CODE" , "PLUS 4" ). This attribute is used for both the US Zip and the Canadian Postal Codes.
MICRO_C
(token number = " 1 "). The class of rules for parsing full MICRO clauses (such as House, street, sufdir, predir, pretyp, suftype, qualif) (ie ARC_C plus CIVIC_C). These rules are not used in the build phase.
MICRO_C
output tokens
(excerpted from
http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--
.
-
HOUSE
-
is a text (token number
1
): This is the street number on a street. Example 75 in75 State Street
. -
predir
-
is text (token number
2
): STREET NAME PRE-DIRECTIONAL such as North, South, East, West etc. -
qual
-
is text (token number
3
): STREET NAME PRE-MODIFIER Example OLD in3715 OLD HIGHWAY 99
. -
pretype
-
is text (token number
4
): STREET PREFIX TYPE -
street
-
is text (token number
5
): STREET NAME -
suftype
-
is text (token number
6
): STREET POST TYPE e.g. St, Ave, Cir. A street type following the root street name. Example STREET in75 State Street
. -
sufdir
-
is text (token number
7
): STREET POST-DIRECTIONAL A directional modifier that follows the street name.. Example WEST in3715 TENTH AVENUE WEST
.
ARC_C
(token number = " 2 "). The class of rules for parsing MICRO clauses, excluding the HOUSE attribute. As such uses same set of output tokens as MICRO_C minus the HOUSE token.
CIVIC_C
(token number = " 3 "). The class of rules for parsing the HOUSE attribute.
EXTRA_C
(token number = " 4 "). The class of rules for parsing EXTRA attributes - attributes excluded from geocoding. These rules are not used in the build phase.
EXTRA_C
output tokens
(excerpted from
http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--
.
-
BLDNG
-
(token number
0
): Unparsed building identifiers and types. -
BOXH
-
(token number
14
): The BOX inBOX 3B
-
BOXT
-
(token number
15
): The 3B inBOX 3B
-
RR
-
(token number
8
): The RR inRR 7
-
UNITH
-
(token number
16
): The APT inAPT 3B
-
UNITT
-
(token number
17
): The 3B inAPT 3B
-
UNKNWN
-
(token number
9
): An otherwise unclassified output.