Chapter 12. Address Standardizer

Chapter 12. Address Standardizer

This is a fork of the PAGC standardizer (original code for this portion was PAGC PostgreSQL Address Standardizer ).

The address standardizer is a single line address parser that takes an input address and normalizes it based on a set of rules stored in a table and helper lex and gaz tables.

The code is built into a single postgresql extension library called address_standardizer which can be installed with CREATE EXTENSION address_standardizer; . In addition to the address_standardizer extension, a sample data extension called address_standardizer_data_us extensions is built, which contains gaz, lex, and rules tables for US data. This extensions can be installed via: CREATE EXTENSION address_standardizer_data_us;

The code for this extension can be found in the PostGIS extensions/address_standardizer and is currently self-contained.

For installation instructions refer to: Section 2.8, “Installing and Using the address standardizer” .

12.1. How the Parser Works

The parser works from right to left looking first at the macro elements for postcode, state/province, city, and then looks micro elements to determine if we are dealing with a house number street or intersection or landmark. It currently does not look for a country code or name, but that could be introduced in the future.

Country code

Assumed to be US or CA based on: postcode as US or Canada state/province as US or Canada else US

Postcode/zipcode

These are recognized using Perl compatible regular expressions. These regexs are currently in the parseaddress-api.c and are relatively simple to make changes to if needed.

State/province

These are recognized using Perl compatible regular expressions. These regexs are currently in the parseaddress-api.c but could get moved into includes in the future for easier maintenance.

12.2. Address Standardizer Types

Abstract

This section lists the PostgreSQL data types installed by Address Standardizer extension. Note we describe the casting behavior of these which is very important especially when designing your own functions.

stdaddr — A composite type that consists of the elements of an address. This is the return type for standardize_address function.

12.3. Address Standardizer Tables

Abstract

This section lists the PostgreSQL table formats used by the address_standardizer for normalizing addresses. Note that these tables do not need to be named the same as what is referenced here. You can have different lex, gaz, rules tables for each country for example or for your custom geocoder. The names of these tables get passed into the address standardizer functions.

The packaged extension address_standardizer_data_us contains data for standardizing US addresses.

rules table — The rules table contains a set of rules that maps address input sequence tokens to standardized output sequence. A rule is defined as a set of input tokens followed by -1 (terminator) followed by set of output tokens followed by -1 followed by number denoting kind of rule followed by ranking of rule.
lex table A lex table is used to classify alphanumeric input and associate that input with (a) input tokens ( See the section called “Input Tokens”) and (b) standardized representations.
gaz table A gaz table is used to standardize place names and associate that input with (a) input tokens ( See the section called “Input Tokens”) and (b) standardized representations.

12.4. Address Standardizer Functions

parse_address — Takes a 1 line address and breaks into parts
standardize_address — Returns an stdaddr form of an input address utilizing lex, gaz, and rule tables.