BlackHat BlackHat - 7 months ago 19
Python Question

Extract specific text from dataframe by separator count and position

Learning regular expressions and stumbled into a bit of a wall.
I have the following dataframe:

item_data=pandas.DataFrame({'item':['001','002','003'],
'description':['Fishing,Hooks,12-inch','Fishing,Lines','Fish Eggs']})


For each description, I want to be extract everything prior to the second comma ",". If there is no comma, then the original description is retained

Results should look like this:

item_data=pandas.DataFrame({'item':['001','002','003'],
'description':['Fishing,Hooks,12-inch','Fishing,Lines','Fish Eggs'],
'new_description':['Fishing,Hooks','Fishing,Lines', 'Fish Eggs']})


Any pointers would be much appreciated.

Thanks.

Answer

Using a regexp...

re.sub("^([^,]*,[^,]*),.*$", "\\1", x)

meaning is

  • ^ start of string
  • ( start capture
  • [^,] anything but a comma
  • * zero or more times
  • , a comma
  • [^,] anything but a comma
  • * zero or more times
  • ) end of capture
  • , another comma
  • .* anything
  • $ end of string

Replacing with the content of group 1 (\1) drops whatever is present after the second comma

Comments