W R W R - 7 months ago 65
Python Question

Python find common columns between Pandas and NumPy and extract Pandas column names

I have a Pandas Dataframe (A) and a NumPy array (B) below

A =
M N C D E F
0.882367 0.207342 0.959867 0.332126 0.031896 0.055734
0.640045 0.411328 0.794058 0.090374 0.629194 0.219321
0.423837 0.304872 0.370467 0.651361 0.017515 0.252440
0.865555 0.692180 0.790405 0.334760 0.863329 0.759971
0.843106 0.261376 0.385936 0.289840 0.063487 0.164913
0.881428 0.257026 0.139775 0.988289 0.953948 0.870969
0.862520 0.446840 0.754147 0.461149 0.607048 0.760438
0.839595 0.486050 0.012903 0.716871 0.155938 0.370666
0.663964 0.675242 0.066046 0.263634 0.242453 0.963562
0.761090 0.501848 0.896033 0.710318 0.581952 0.392896

B =
[[ 0.20734235 0.33212606 0.03189633]
[ 0.41132799 0.09037417 0.6291936 ]
[ 0.30487215 0.65136057 0.01751531]
[ 0.69217974 0.3347596 0.86332925]
[ 0.26137593 0.28984018 0.06348744]
[ 0.25702646 0.98828911 0.95394809]
[ 0.44684032 0.46114941 0.60704784]
[ 0.4860496 0.71687057 0.15593771]
[ 0.67524202 0.26363435 0.24245288]
[ 0.50184753 0.71031779 0.58195151]]


The size of
A
is (10,6) and the size of
B
is (10,3). The columns in the NumPy array
B
are a subset of the columns in the Pandas dataframe
A
. I do not know how the subset is formed ahead of time.

How can I find a list of names of the columns in the Pandas dataframe (
A
) that are also in the NumPy array (
B
)? The expected output should be a list
['N','D','E']
.

EDIT: To create
A
and
B
above this code will work:

import numpy as np; import pandas as pd
A = pd.DataFrame(np.random.rand(10,6),columns=list('MNCDEF'))
y = A.iloc[:,[1,3,4]].values
B = y.view('float64')
B[:] = y


Note: using this method, the numbers you get will not be identical to my data. However, for the purposes of this example it should be sufficient.

Answer

You can use .duplicated() - first, combine using pd.concat(), then select the matching columns (can of course just select headers using df.columns):

B = pd.DataFrame(B)
df = pd.concat([B, A], axis=1).T
df[df.duplicated()].T

          N         D         E
0  0.220376  0.275217  0.029644
1  0.751950  0.170162  0.996459
2  0.597565  0.440468  0.239183
3  0.775364  0.476966  0.056998
4  0.747164  0.654597  0.056527
5  0.825209  0.699910  0.374902
6  0.800624  0.837948  0.833588
7  0.420070  0.102400  0.204857
8  0.628885  0.345432  0.172771
9  0.002239  0.868492  0.225563