I have a csv file from a database I've converted into a Pandas DataFrame that I'm trying to clean up. One of the issues is that multiple values have been input into single cells that need to be split up. The complicating factor is that there are string comments (also with commas) that need to be kept intact. The problem is illustrated in the example below, in Series form.
What I have:
Index | values
0 | 2.54,3.563
1 | bad design, right?
Index | level_0 | values
0 | 0 | 2.54
1 | 0 | 3.563
2 | 1 | bad design, right?
Import pandas as pd
# Example Series:
data = pd.Series(("2.54,3.56", "3.24,5.864", "bad design, right?"), name = "values")
# Split cells with multiple entries into separate rows
split_data = data.str.split('[,]\b').apply(pd.Series)
# Stack the results and pull out the index into a column (which is sample number in my case)
split_data = split_data.stack().reset_index(0)
split_data = split_data.reset_index(drop=True)
I would be inclined to use a lookahead; how you do so depends on your expected data.
This is a negative lookahead. it says "a comma that is not followed by whitespace" and would be preferred if you are sure that all comments with commas have whitespace, and would want to treat "red,green" as something to split.
Another option is a positive lookahead for something that looks like a valid value; your example was numbers, so for instance this would split only on a comma that is followed by a number:
Regular expressions are very powerful, but honestly, I am not sure that this solution will be great for you if this is a long-term problem. Getting most cases right as a one-time migration should be fine, but longer term I would consider trying to solve the problem before it gets here. Anyway, here's Debuggex's python regex cheat sheet, in case it is useful to you: https://www.debuggex.com/cheatsheet/regex/python