4ccc4 4ccc4 - 1 year ago 45
SQL Question

finding people with possible incorrectly spelled cities where zip codes match

I am trying to create a report that will return a list of people whose cities most likely need to be corrected.

I was thinking of comparing the data against other data within the table to leverage the assumption that most of the cities are spelled correctly. Take Albuquerque, for example. We have records for many of the zip codes, but the city isn't always spelled correctly.

I can't figure out my next step.

Here's what I have started with:

SELECT city, zip_5_digits, COUNT(*) AS "COUNT"
FROM people

INNER JOIN addresses
ON addresses.people_id = people.id
AND city LIKE 'Albu%que'

GROUP BY city, zip_5_digits

Doing this results in

Albuqureque 87108 1
Albuquerque 87108 238
Albuqerque 87109 1
Albuquerque 87109 34

What I'd like to do is, for each row, find the maximum records where the zip code matches but the city does not match. If there is no match, I want to return that record, and I'll use this to return people's id and names, since I most likely need to correct the name of the city for those people who have it mis-spelled.

Answer Source

This is hard, because some "cities" have very few residents. And, some zip codes might just have a small part of a city.

I would recommend two rules:

  • Look at zip codes that have at least a certain number of people -- say 100.
  • Look at cities in the zip code that have less than some number -- say 5.

There are candidates for misspellings:

FROM (SELECT city, zip_5_digits, COUNT(*) AS cnt,
             MAX(COUNT(*)) OVER (PARTITION BY zip_5_digits) as max_cnt,
             SUM(COUNT(*)) OVER (PARTITION BY zip_5_digits) as sum_cnt
      FROM people p, INNER JOIN
           addresses a
           ON a.people_id = p.id
      GROUP BY city, zip_5_digits
     ) pa
WHERE sum_cnt >= 100 AND cnt <= 5;