Cole Robertson Cole Robertson - 19 days ago 5
Python Question

Python Pandas subset column x values based on unique values in column y

I have a dataframe ( "df") equivalent to:

Cat Data
x 0.112
x 0.112
y 0.223
y 0.223
z 0.112
z 0.112


In other words I have a category column and a data column, and the data values do not vary within values of the category column, but they may repeat themselves between different categories (i.e. the values in categories 'x' and 'z' are the same -- 0.112). This means that I need to select one data point from each category, rather than just subsetting on unique values of "Data".

The way I've done it is like this:

aLst = []
bLst = []
for i in df.index:
if df.loc[i,'Cat'] not in aLst:
aLst += [df.loc[i,'Cat']]
bLst += [i]

new_series = pd.Series(df.loc[bLst,'Data'])


Then I can do whatever I want with it. But the problem is this just seems like a clunky, un-pythonic way of doing things. Any suggestions?

Answer

I think you need drop_duplicates:

#by column Cat
print (df.drop_duplicates(['Cat']))
  Cat   Data
0   x  0.112
2   y  0.223
4   z  0.112

Or:

#by columns Cat and Value
print (df.drop_duplicates(['Cat','Data']))
  Cat   Data
0   x  0.112
2   y  0.223
4   z  0.112