BlackHat BlackHat - 5 months ago 11
Python Question

Extract specific text from dataframe by separator count and position

Learning regular expressions and stumbled into a bit of a wall.
I have the following dataframe:

'description':['Fishing,Hooks,12-inch','Fishing,Lines','Fish Eggs']})

For each description, I want to be extract everything prior to the second comma ",". If there is no comma, then the original description is retained

Results should look like this:

'description':['Fishing,Hooks,12-inch','Fishing,Lines','Fish Eggs'],
'new_description':['Fishing,Hooks','Fishing,Lines', 'Fish Eggs']})

Any pointers would be much appreciated.



Using a regexp...

re.sub("^([^,]*,[^,]*),.*$", "\\1", x)

meaning is

  • ^ start of string
  • ( start capture
  • [^,] anything but a comma
  • * zero or more times
  • , a comma
  • [^,] anything but a comma
  • * zero or more times
  • ) end of capture
  • , another comma
  • .* anything
  • $ end of string

Replacing with the content of group 1 (\1) drops whatever is present after the second comma