chickensoup chickensoup - 20 days ago 6
Python Question

Slicing dataframe by columns but columns in blacklist

I want to get a subset which is from a dataframe, but some columns in some labels I want to ignored ( K-means implement with classified column) So I want to find a way to get rid of that labels, only the rest I want to use.
Here is dataframe:

28 29 30 31 32 Phase clusterIndex
0 0.007871 0.004631 0.000963 0.000092 0.000438 D 0
1 0.003459 0.000730 0.000332 0.000012 0.000433 D 0
2 0.003261 0.002412 0.000852 0.000042 0.000202 D 0
3 0.001358 0.000313 0.000611 0.000029 0.000596 D 1
4 0.001713 0.000203 0.000069 0.000038 0.000069 D 1
5 0.001656 0.000041 0.000048 0.000221 0.000045 D 1
6 0.001348 0.000023 0.000107 0.000316 0.000109 D 1
7 0.001544 0.000194 0.000138 0.000829 0.000138 D 1
8 0.000359 0.000469 0.000278 0.000290 0.000279 D 1
9 0.000397 0.000351 0.000232 0.000449 0.000230 D 1


I just want remove 'Phase' and 'clusterIndex' to a new dataframe to process.

Answer

You can use list comprehension:

blacklisted = ['Phase', 'clusterIndex']

cols = [col for col in df.columns if col not in blacklisted]
print (cols)
['28', '29', '30', '31', '32']

print (df[cols])
         28        29        30        31        32
0  0.007871  0.004631  0.000963  0.000092  0.000438
1  0.003459  0.000730  0.000332  0.000012  0.000433
2  0.003261  0.002412  0.000852  0.000042  0.000202
3  0.001358  0.000313  0.000611  0.000029  0.000596
4  0.001713  0.000203  0.000069  0.000038  0.000069
5  0.001656  0.000041  0.000048  0.000221  0.000045
6  0.001348  0.000023  0.000107  0.000316  0.000109
7  0.001544  0.000194  0.000138  0.000829  0.000138
8  0.000359  0.000469  0.000278  0.000290  0.000279
9  0.000397  0.000351  0.000232  0.000449  0.000230

Or difference:

blacklisted = ['Phase', 'clusterIndex']

cols = df.columns.difference(blacklisted)
print (cols)
Index(['28', '29', '30', '31', '32'], dtype='object')

print (df[cols])
         28        29        30        31        32
0  0.007871  0.004631  0.000963  0.000092  0.000438
1  0.003459  0.000730  0.000332  0.000012  0.000433
2  0.003261  0.002412  0.000852  0.000042  0.000202
3  0.001358  0.000313  0.000611  0.000029  0.000596
4  0.001713  0.000203  0.000069  0.000038  0.000069
5  0.001656  0.000041  0.000048  0.000221  0.000045
6  0.001348  0.000023  0.000107  0.000316  0.000109
7  0.001544  0.000194  0.000138  0.000829  0.000138
8  0.000359  0.000469  0.000278  0.000290  0.000279
9  0.000397  0.000351  0.000232  0.000449  0.000230

Numpy solution with numpy.setdiff1d:

blacklisted = ['Phase', 'clusterIndex']

cols = np.setdiff1d(df.columns, blacklisted)
print (cols)
['28' '29' '30' '31' '32']

print (df[cols])
         28        29        30        31        32
0  0.007871  0.004631  0.000963  0.000092  0.000438
1  0.003459  0.000730  0.000332  0.000012  0.000433
2  0.003261  0.002412  0.000852  0.000042  0.000202
3  0.001358  0.000313  0.000611  0.000029  0.000596
4  0.001713  0.000203  0.000069  0.000038  0.000069
5  0.001656  0.000041  0.000048  0.000221  0.000045
6  0.001348  0.000023  0.000107  0.000316  0.000109
7  0.001544  0.000194  0.000138  0.000829  0.000138
8  0.000359  0.000469  0.000278  0.000290  0.000279
9  0.000397  0.000351  0.000232  0.000449  0.000230

Solution with drop columns:

blacklisted = ['Phase', 'clusterIndex']
print (df.drop(blacklisted, axis=1))
         28        29        30        31        32
0  0.007871  0.004631  0.000963  0.000092  0.000438
1  0.003459  0.000730  0.000332  0.000012  0.000433
2  0.003261  0.002412  0.000852  0.000042  0.000202
3  0.001358  0.000313  0.000611  0.000029  0.000596
4  0.001713  0.000203  0.000069  0.000038  0.000069
5  0.001656  0.000041  0.000048  0.000221  0.000045
6  0.001348  0.000023  0.000107  0.000316  0.000109
7  0.001544  0.000194  0.000138  0.000829  0.000138
8  0.000359  0.000469  0.000278  0.000290  0.000279
9  0.000397  0.000351  0.000232  0.000449  0.000230