r0xette r0xette - 1 month ago 10
Python Question

Extract values from pandas stream

I have very weird data coming via curl into my pandas dataframe. What I would like to do is extract values out of the column as described below. Can someone guide me how to extract the info?

cc = pd.read_csv(cc_curl)
print(cc['srv_id'])

srv_id
------
TicketID 14593_ServiceID 104731
ServiceID
TicketID 14595_ServiceID 104732
TicketID 14609_ServiceID 0
TicketID 0_ServiceID 178282



  1. Extract 5 digit ticket id and 6 digit service id.

  2. Extract nothing since there is no ticketID and service ID is blank.

  3. Extract 5 digit ticket id and 6 digit service id.

  4. Extract 5 digit ticket id only and service id should be blank since it is 0.

  5. Extract 6 digit service id only and leave ticket ID blank since it is 0.



Desired output

srv_id
------
14593 104731

14595 104732
14609
178282

Answer

If you want to extract this information into two new columns, you can do it this way:

import numpy as np
import pandas as pd

In [22]: df[['TicketID','ServiceID']] = (
    ...:   df.srv_id.str.extract(r'TicketID\s+(\d+).*?ServiceID\s+(\d+)', expand=True)
    ...:     .replace(r'\b0\b', np.nan, regex=True)
    ...: )
    ...:

In [23]: df
Out[23]:
                            srv_id TicketID ServiceID
0  TicketID 14593_ServiceID 104731    14593    104731
1                       ServiceID       NaN       NaN
2  TicketID 14595_ServiceID 104732    14595    104732
3       TicketID 14609_ServiceID 0    14609       NaN
4      TicketID 0_ServiceID 178282      NaN    178282

If you want to replace your string with extracted numbers:

In [161]: df['new_srv_id'] = \
              df.srv_id.replace([r'[^\d{5,}]+', r'\s*\b0\b\s*'], [' ', ''], regex=True)

In [162]: df
Out[162]:
                            srv_id     new_srv_id
0  TicketID 14593_ServiceID 104731   14593 104731
1                       ServiceID
2  TicketID 14595_ServiceID 104732   14595 104732
3       TicketID 14609_ServiceID 0          14609
4      TicketID 0_ServiceID 178282         178282