Chapter 12. Address Standardizer
This is a fork of the PAGC standardizer (original code for this portion was PAGC PostgreSQL Address Standardizer ).
The address standardizer is a single line address parser that takes an input address and normalizes it based on a set of rules stored in a table and helper lex and gaz tables.
The code is built into a single postgresql extension library called
address_standardizer
which can be installed with
CREATE EXTENSION address_standardizer;
. In addition to the address_standardizer extension, a sample data extension called
address_standardizer_data_us
extensions is built, which contains gaz, lex, and rules tables for US data. This extensions can be installed via:
CREATE EXTENSION address_standardizer_data_us;
The code for this extension can be found in the PostGIS
extensions/address_standardizer
and is currently self-contained.
For installation instructions refer to: Section 2.8, “Installing and Using the address standardizer” .
The parser works from right to left looking first at the macro elements for postcode, state/province, city, and then looks micro elements to determine if we are dealing with a house number street or intersection or landmark. It currently does not look for a country code or name, but that could be introduced in the future.
- Country code
-
Assumed to be US or CA based on: postcode as US or Canada state/province as US or Canada else US
- Postcode/zipcode
-
These are recognized using Perl compatible regular expressions. These regexs are currently in the parseaddress-api.c and are relatively simple to make changes to if needed.
- State/province
-
These are recognized using Perl compatible regular expressions. These regexs are currently in the parseaddress-api.c but could get moved into includes in the future for easier maintenance.
-
stdaddr
— A composite type that consists of the elements of an address. This is the return type for
standardize_address
function.
This section lists the PostgreSQL table formats used by the address_standardizer for normalizing addresses. Note that these tables do not need to be named the same as what is referenced here. You can have different lex, gaz, rules tables for each country for example or for your custom geocoder. The names of these tables get passed into the address standardizer functions.
The packaged extension
address_standardizer_data_us
contains data for standardizing US addresses.
- rules table — The rules table contains a set of rules that maps address input sequence tokens to standardized output sequence. A rule is defined as a set of input tokens followed by -1 (terminator) followed by set of output tokens followed by -1 followed by number denoting kind of rule followed by ranking of rule.
- lex table A lex table is used to classify alphanumeric input and associate that input with (a) input tokens ( See the section called “Input Tokens”) and (b) standardized representations.
- gaz table A gaz table is used to standardize place names and associate that input with (a) input tokens ( See the section called “Input Tokens”) and (b) standardized representations.
- parse_address — Takes a 1 line address and breaks into parts
- standardize_address — Returns an stdaddr form of an input address utilizing lex, gaz, and rule tables.