Paul Stoner Paul Stoner - 22 days ago 8
Python Question

pandas how to eliminate duplicate rows before they occur

I have a dataframe consisting of the State name and City Name. However, the City names are not simply Pittsburg, Philadelphia, etc. The city name may contain what I call prestige names. Here is a small sample

State RegionName
Pennsylvania California (California Uni...
Pennsylvania Carlisle (Dickinson College)
Pennsylvania Cecil B. Moore, Philadelphia, also...
...
Pennsylvania University City, Philadelphia (Drexel Universi...


I need to clean up this data by removing the parenthetical information and such. But my question is this. Both Cecil B. Moore and University City are parts of Philadelphia. If I rename these values the I have two rows of Pennsylvania Philadelphia in my data set. I don't want that.

So from a data science perspective, is it acceptable for me to simply delete one of these rows and rename the RegionName value in the other? Or is there some way, in pandas, to "combine" these rows after cleanup and renaming.

This data will eventually be married to housing values by state and region name (city).

Thank you

Answer

Just ingest all of the row, then use .drop_duplicates() to remove the duplicate rows from the data frame.