xxxvinxxx xxxvinxxx - 2 months ago 22
Python Question

Parse pandas df column with regex extracting substrings

I have a pandas df containing a column composed of text like:

String1::some_text::some_text;String2::some_text::;String3::some_text::some_text;String4::some_text::some_text


I can see that:


  1. The start of the text always contains the first string I want to extract

  2. The rest of the strings are in between "::" and ";"



I want to create a new column containing:

String1, String2, String3, String4


All separed by a comma but still in the same column.

How to approach the problem?

Thanks for your help

Answer

I would just apply a lambda function to do the operation you want to do (split first on ";", then split on "::" and keep the first element, and join them back):

df['new_col'] = df['old_col'].apply(lambda s: ", ".join(t.split("::")[0] for t in s.split(";")))

You could also avoid splitting on :: since simply stopping before the first : is enough:

df['new_col'] = df['old_col'].apply(lambda s: ", ".join(t[:t.index(":")] for t in s.split(";")))
Comments