I am trying to create a report that will return a list of people whose cities most likely need to be corrected.
I was thinking of comparing the data against other data within the table to leverage the assumption that most of the cities are spelled correctly. Take Albuquerque, for example. We have records for many of the zip codes, but the city isn't always spelled correctly.
I can't figure out my next step.
Here's what I have started with:
SELECT city, zip_5_digits, COUNT(*) AS "COUNT"
INNER JOIN addresses
ON addresses.people_id = people.id
AND city LIKE 'Albu%que'
GROUP BY city, zip_5_digits
Albuqureque 87108 1
Albuquerque 87108 238
Albuqerque 87109 1
Albuquerque 87109 34
This is hard, because some "cities" have very few residents. And, some zip codes might just have a small part of a city.
I would recommend two rules:
There are candidates for misspellings:
SELECT pa.* FROM (SELECT city, zip_5_digits, COUNT(*) AS cnt, MAX(COUNT(*)) OVER (PARTITION BY zip_5_digits) as max_cnt, SUM(COUNT(*)) OVER (PARTITION BY zip_5_digits) as sum_cnt FROM people p, INNER JOIN addresses a ON a.people_id = p.id GROUP BY city, zip_5_digits ) pa WHERE sum_cnt >= 100 AND cnt <= 5;