Patthebug Patthebug - 6 months ago 30
Python Question

Using split to populate a column in Pandas dataframe

I'm working with text data and I would simply like to populate a new column based on an existing column.

Example: the column

sourceEncodedID
may have values like
a.b.c
and I'd like to extract only the second part of the string,
b
, if there's a second part available. Here are some example values:

sourceEncodedID Branch
a.b.c b
c.r.d r
a a
p p


To achieve this, I have come up with the following code:

for i in range(0,20350):
if len(str(artifacts.sourceEncodedID[i]).split('.')) > 1:
artifacts['branch'][i] = str(artifacts.sourceEncodedID[i]).split('.')[1]
else:
artifacts['branch'][i] = str(artifacts.sourceEncodedID[i])


There are only 20k rows in the dataframe and yet this piece of code take minutes to execute, before never finishing and rendering my browser unreponsive (I'm using
ipython notebook
). I would have thought this would run in a couple of seconds.

Is there something obviously silly in this code that I'm unable to catch? How do I fix it?

Answer

UPDATE:

In [68]: x['new'] = x.sourceEncodedID

In [69]: x
Out[69]:
  sourceEncodedID Branch    new
0           a.b.c      b  a.b.c
1           c.r.d      r  c.r.d
2               a      a      a
3               p      p      p

In [70]: x.ix[x.sourceEncodedID.str.contains('\.'), 'new'] = x.sourceEncodedID.str.split('\.', expand=True)[1]

In [71]: x
Out[71]:
  sourceEncodedID Branch new
0           a.b.c      b   b
1           c.r.d      r   r
2               a      a   a
3               p      p   p

when working with pandas data frames first always try to find vectorized solution. And only if it's absolutely not possible then double-check it and only after that try loop through approach, as it will be orders of magnitude slower.

OLD answer:

try this:

In [61]: x.sourceEncodedID.str.split('\.', expand=True)[1]
Out[61]:
0       b
1       r
2    None
3    None
Name: 1, dtype: object
Comments