Michal Holub Michal Holub - 3 months ago 10
Python Question

Can pandas dataframe have dtype of list?

I'm new to Pandas, I process a dataset, where one of the columns is string with pipe (

|
) separated values. Now I have a task to remove any text in this |-separated field that's not fulfilling certain criteria.

My naive approach is to iterate the dataframe row by row and explode the field into list and validate this way. Then write the modified row back to the original dataframe. See this metasample:

for index, row in dataframe.iterrows():
fixed = [x[:29] for x in row['field'].split('|')]
dataframe.loc[index, 'field'] = "|".join(fixed)


Is there a better, and more importantly faster way to do this?

Answer

IIUC you can use:

dataframe = pd.DataFrame({'field':['aasd|bbuu|cccc|ddde|e','ffff|gggg|hhhh|i|j','cccc|u|k'],
                          'G':[4,5,6]})

print (dataframe)
   G                  field
0  4  aasd|bbuu|cccc|ddde|e
1  5     ffff|gggg|hhhh|i|j
2  6               cccc|u|k


print (dataframe.field.str.split('|', expand=True)
                          .stack()
                          .str[:2] #change to 29
                          .groupby(level=0)
                          .apply('|'.join))

0    aa|bb|cc|dd|e
1     ff|gg|hh|i|j
2           cc|u|k
dtype: object

Another solution via list comprehension:

dataframe['new'] = pd.Series([[x[:2] for x in y] for y in dataframe.field.str.split('|')], 
                             index=dataframe.index)
                     .apply('|'.join)
print (dataframe)
   G                  field            new
0  4  aasd|bbuu|cccc|ddde|e  aa|bb|cc|dd|e
1  5     ffff|gggg|hhhh|i|j   ff|gg|hh|i|j
2  6               cccc|u|k         cc|u|k

dataframe = pd.DataFrame({'field':['aasd|bbuu|cc|ddde|e','ffff|gggg|hhhh|i|j','cccc|u|k'],
                   'G':[4,5,6]})

print (dataframe)
   G                field
0  4  aasd|bbuu|cc|ddde|e
1  5   ffff|gggg|hhhh|i|j
2  6             cccc|u|k

If need filter all values with values longer as 2:

s = dataframe.field.str.split('|', expand=True).stack()
print (s)
0  0    aasd
   1    bbuu
   2      cc
   3    ddde
   4       e
1  0    ffff
   1    gggg
   2    hhhh
   3       i
   4       j
2  0    cccc
   1       u
   2       k
dtype: object

dataframe['new'] = s[s.str.len() < 3].groupby(level=0).apply('|'.join)
print (dataframe)

   G                field   new
0  4  aasd|bbuu|cc|ddde|e  cc|e
1  5   ffff|gggg|hhhh|i|j   i|j
2  6             cccc|u|k   u|k

Another solution:

dataframe['new'] = pd.Series([[x for x in y if len(x) < 3] for y in dataframe.field.str.split('|')], 
                              index=dataframe.index)
                     .apply('|'.join)
print (dataframe)
   G                field   new
0  4  aasd|bbuu|cc|ddde|e  cc|e
1  5   ffff|gggg|hhhh|i|j   i|j
2  6             cccc|u|k   u|k