At work we have a copy of the zip-codes.com Business database, which I tend to reference somewhat frequently due to the nature of projects I work on.
Today I needed a list of postal codes from major metro areas which would be used to drive a test data generator. The data generator has access to a simpler, non-commercial zip code list, and can do radius searches, but that's about it. I needed one code per metro area, and it worked out best to have the one closest to the center of the metro areas.
What I ended up with was a script to select the ZIP code closest to the approximate center for each of the N most populous CBSAs.
Some notes:
PrimaryRecord ='P'
: As it comes from the data provider, the data is not fully de-normalized. A ZIPcode could have multiple records, when it is part of a place that has multiple names. The 'P' record indicates the primary record for a given ZIP.- This query uses the faster Pythagorean distance, which works ok over small distances and lower latitudes. The great circle distance would be better.
WITH
topPopulationAreas AS (
SELECT TOP 50 CBSA_Name,
avg(Latitude) AS lat,
avg(Longitude) AS lon
FROM zipcode
WHERE PrimaryRecord = 'P' AND CBSA_Name != '' AND Population > 0
GROUP BY CBSA_Name
ORDER BY sum(Population) DESC
),
zipDistance AS (
SELECT DISTINCT ZipCode,
zc.CBSA_Name,
sqrt(square(lat - latitude) + square(lon - longitude)) AS dist
FROM zipcode zc
JOIN topPopulationAreas iq ON iq.CBSA_Name = zc.CBSA_Name
),
shortest AS (
SELECT CBSA_Name,
min(dist) AS mindist
FROM zipDistance
GROUP BY CBSA_Name
)
SELECT min(zd.ZipCode), zd.CBSA_Name, dist, Latitude, Longitude
FROM zipDistance zd
INNER JOIN shortest ON zd.dist = shortest.mindist AND zd.CBSA_Name = shortest.CBSA_Name
INNER JOIN zipcode z ON z.ZipCode = zd.ZipCode AND z.PrimaryRecord = 'P'
GROUP BY zd.CBSA_Name, dist, Latitude, Longitude