I'm scraping some data on college basketball teams from ESPN's BPI page (http://www.espn.com/mens-college-basketball/bpi/_/view/resume) to store in a pandas dataframe. When I read the html table into a dataframe, the abbreviated school name is appended to the full school name. E.g I have several strings that looks like this: "North CarolinaUNC".
How can I remove the UNC from the end of the string? I tried the below regex to match characters at the end of strings:
name = "North CarolinaUNC"
name = re.sub(r"\z[A-Z]","", name)
$ to match the end of the string, and non-matching lookbehind to check if the uppercase letters come after lowercase letters:
import re name = "North CarolinaUNC" name = re.sub(r"(?<=[a-z])[A-Z]+$","", name)
North Carolina all right.
And with that expression,
"North Carolina UNC" stays unmodified because the uppercase letters, even if at the end of the string, do not come after a lowercase letter.