mik.ferrucci mik.ferrucci - 9 months ago 44
Python Question

Pandas new column as string extraction of another only for certain condition on string length verified: Fast way

I am working with a large df (near 2 millions rows) and need to create a new column from another one. The task seems easy: the starting column, called "PTCODICEFISCALE" contains a string made of 11 either 16 characters, no other possibilities, no NaN.
The new column I have to create ("COGNOME") must contain the 3 first characters of "PTCODICEFISCALE" ONLY IF the lenght of the "PTCODICEFISCALE" nth-row is 16; else when the lenght is 11 the new column should contain nothing, which means "NaN" I think.

I have tried this:

csv.loc[len(csv['PTCODICEFISCALE']) == 16, 'COGNOME'] = csv.loc[csv.PTCODICEFISCALE.str[:3]]

In the output this error message appears:

ValueError: cannot index with vector containing NA / NaN values

Which I don't understand.
I am sure there are no NA /NaN in "PTCODICEFISCALE" column.

Any help? Thanks!

P.S.: "csv" is the name of the DataFrame

Answer Source

I think you need numpy.where and condition with str.len:

csv['COGNOME'] = np.where(csv.PTCODICEFISCALE.str.len() == 16, csv.PTCODICEFISCALE.str[:3], np.nan)


csv = pd.DataFrame({'PTCODICEFISCALE':['0123456789123456','1','01234567891234']})
print (csv)
0  0123456789123456
1                 1
2    01234567891234

csv['COGNOME'] = np.where(csv.PTCODICEFISCALE.str.len() == 16, csv.PTCODICEFISCALE.str[:3], np.nan)

print (csv)
0  0123456789123456     012
1                 1     NaN
2    01234567891234     NaN