CodeLearner CodeLearner - 2 months ago 51
Python Question

Extracting strings from a pandas dataframe

I'm here again hoping to find a solution to my coding nightmare. I have a dictionary

term_dict
with list of terms as keys and term category as values. And a dataframe
data
with ID and Notes columns. The task is to find matching terms in the
data.Notes
using the
term_dict
for every
data.ID
record.

term_dict{
Ibuprofen 800mg : Drug
Hip Replacement Surgery : Treatment
Tylenol AM : Drug
Mild Dislocation : Treatment
Advil : Drug
Fractured Tibia : Treatment
Quinone : Drug
Fever : Treatment
Penicillin 250mg : Drug
Histerectomy : Treatment
Surgical removal of bunion : Treatment
Therapy : Treatment
Bunion : Treatment
Hospita X : Location
mg : Dosage
stop : Exclusion
}

data:
ID Notes
604 Take 2 tablets of advil & 3 caps of pen
250mg twice daily
602 Stop pen but cont. with advil
as needed for the fracture
210 2 tabs of Tyl 3x daily for 5 days
607 nan
700 surgery scheduled for 01/01/2017
515 nan
019 Call my office if bunion pain persist
after 3 days
604 f/up appt. @Hospital X


So far, this is my code:

lists = []
for s in data['Notes']:
cleanNotes = " " + " ".join(re.split(r'[^a-z 0-9]|[w/]',s.lower())) + " "
for k, v in term_dict.items():
k = " %s "%k
if k in cleanNotes and v != exclusion:
if k in cleanNotes and v == 'drug':
lists.append(k)
data['Drug'] = ':'.join(str(lists))
elif k in cleanNotes and v == 'location':
lists.append(k)
data['Location'] = ' '.join(str(lists))
elif k in cleanNotes and v == 'treatment':
lists.append(k)
data['Treatment'] = ':'.join(str(lists))
elif k in cleanNotes and v == 'dosage':
lists.append(k)
data['Dosage'] = ':'.join(str(lists))
else:
for s in data.Notes:
matches = list(datefinder.find_dates(s.lower()))
data['Date'] = ', '.join([str(dates) for dates in matches])


....and my output is not what's expected because the code just populates the new columns of the dataframe with matches from he last record of the dataframe:

data:
ID Notes Drug Dosage Location Treatment Date
604 Take 2 tablets of advil & 3 caps of pen advil Hospital X
250mg twice daily
602 Stop pen but cont. with advil advil Hospital X
as needed for the fracture
210 2 tabs of Tyl 3x daily for 5 days advil
607 nan advil
700 surgery scheduled for 01/01/2017 advil
515 nan advil
019 Call my office if bunion pain persist advil
after 3 days
604 f/up appt. @Hospital X. cont w/advil advil Hospital X


***But expected Output:

data:
ID Notes Drug Dosage Location Treatment Date
604 Take 2 tablets of advil & 3 caps of pen advil:penicilin 0:250mg
250mg twice daily
602 Stop pen but cont. with advil advil fracture
as needed for the fracture
210 2 tabs of Tyl 3x daily for 5 days Tylenol
607 nan
700 surgery scheduled for 01/01/2017 surgery 01/01/2017
515 nan
019 Call my office if bunion pain persist bunion
after 3 days
604 f/up appt. @Hospital X. cont w/advil advil Hospital X


I'd be more than grateful if I can get this duplication fixed. Thanks!

Answer Source

The essence of your error is this. You assign every element of that column to the same value:

In [114]: import pandas as pd

In [115]: df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))

In [116]: df.head()
Out[116]: 
          A         B         C         D
0 -0.896291 -0.277551  0.926559  0.522212
1 -0.265559 -1.300435 -0.079514 -1.083569
2 -0.534509  0.298264 -1.361829  0.750666
3  0.318937 -0.407164  0.080020  0.499435
4 -0.161574 -1.012471  0.631092  1.368540

In [117]: df['NewCol'] = 'something here'

In [119]: df.head()
Out[119]: 
          A         B         C         D          NewCol
0 -0.896291 -0.277551  0.926559  0.522212  something here
1 -0.265559 -1.300435 -0.079514 -1.083569  something here
2 -0.534509  0.298264 -1.361829  0.750666  something here
3  0.318937 -0.407164  0.080020  0.499435  something here
4 -0.161574 -1.012471  0.631092  1.368540  something here

To fix this, what you can do is create empty columns up front, like this:

In [120]: df = pd.DataFrame(np.random.randn(50, 1), columns=['Notes'])

In [121]: df['Drug'] = ""
     ...: df['Location'] = ""
     ...: df['Treatment'] = ""
     ...: df['Dosage'] = ""
     ...: 

In [122]: df.head()
Out[122]: 
      Notes Drug Location Treatment Dosage
0  0.325993                               
1 -0.561066                               
2  0.555040                               
3  0.001332                               
4  0.400009  

When looping over notes, use an enumerated loop:

for i, s in enumerate(data['Notes']):

Then, when the need arises, just set that cell appropriately:

df.set_value(i, 'Drug', 'advil')