PostGIS Odyssey Part 1

For a current (unreleased) project, I was starting to wander into the crazy world of maps. I started out with a very simple model of zip codes represented as points, but I had reached the limits of that approach being effective. I needed polygon data that gave a "truer" representation of what it means to be a zip code.

U.S. Census Data

My online search led me to the U.S. Census web site, where there is a bewildering array of available data, but with a little perseverence, I found this page: http://www2.census.gov/geo/tiger/TIGER2010/ZCTA5/2010/. This page has zip files with state-by-state ZCTA shape files. As I discovered, ZCTA is not the same as Zip Code data. The short version is this: There are more zip codes than ZCTAs, and ZCTAs (defined by the U.S. Census) are really an approximation of zip codes (defined by the U.S. Post Office). As a concrete example, the ZCTA file for Georgia has 735 zip code records compared to 960 actual zip codes.

Geography Project

I decided to stick with the individual state files after some memory-related mishaps trying to work directly with the overall U.S. data file. There are 57 files in all with a compressed size of 527.9 MB. That is not a trivial amount of data. Rather than burden my project directly with all of this information, I decided to create an application that would allow me to experiment with this data before deciding the best way to integrate it into my actual project.

Each zip file (e.g., tl_2010_01_zcta510.zip) consists of several files:

  • tl_2010_01_zcta510.dbf
  • tl_2010_01_zcta510.prj
  • tl_2010_01_zcta510.shp
  • tl_2010_01_zcta510.shp.xml
  • tl_2010_01_zcta510.shx

Together, these files define the geometry of ZCTAs in Alabama. Now, my interest is in creating PostGIS representations of this data. Because we live in a Unix world, I was not surprised to discover a handy command line tool called shp2pgsql. This tool converts a suite of shape files into PostGIS-ready records.

shp2pgsql -a -D -s 4269 /path/to/tl_2010_01_zcta510.shp zip_code > sql/data/alabama.sql

Key features of this statement:

  • -s (short for SRID, or spatial reference identifier) is the projection you want to use; census data is all based on 4269 (NAD 83, a 1983 datum appropriate for parts of North America)
  • -D causes the output to be in pg_dump format (it uses the COPY approach for much faster data loading than a series of INSERT statements)
  • zip_code is the name of the table to generate the SQL for (technically, it is the schema, so you might say "public.zip_code" to be more precise)
  • > precedes the name of the SQL file to be generated
  • -a means only append the data to the table (as opposed to -c where the script also creates the relevant table)

Creating the Database

psql template1 postgres
CREATE USER geographer CREATEDB PASSWORD 'magellan';
\q
createdb -U geographer --encoding UNICODE --template template0 geography
psql -q -U postgres -c "CREATE EXTENSION postgis;" -d geography
psql -q -U postgres -c "ALTER TABLE spatial_ref_sys OWNER TO geographer;" -d geography
psql -q -U geographer -f sql/schema.sql -d geography

Listing: schema.sql

CREATE TABLE zip_code (
    gid serial,
    statefp10 varchar(2),
    zcta5ce10 varchar(5),
    geoid10 varchar(7),
    classfp10 varchar(2),
    mtfcc10 varchar(5),
    funcstat10 varchar(1),
    aland10 float8,
    awater10 float8,
    intptlat10 varchar(11),
    intptlon10 varchar(12),
    partflg10 varchar(1),
    geom geometry(multipolygon,4269)
);
ALTER TABLE zip_code ADD PRIMARY KEY (gid);

I built a shell script to build the database, convert all the state shape files to SQL loaders, and then load the PostGIS SQL. When that was done, I had a database that was roughly 660 MB in size. That dwarfs the size of the rest of the data that will comprise my primary project. My next blog entry will explore what options I have for dealing with this potential problem.