M.Schreiber M.Schreiber - 4 days ago 6
Python Question

How to access values in another column of a panda dataframe

Let's say I have a dataframe :

d = pd.DataFrame({'Salary' : pd.Series([1, 20000, 5, 1000, 3000],
index = ['Joe', 'Steph', 'Jared', 'Oliver', 'Gaby']),
'Sex' : pd.Series([0, 1, 0, 0, 1],
index=['Joe', 'Steph', 'Jared', 'Oliver', 'Gaby'])})

Salary Sex
Joe 1 0
Steph 20000 1
Jared 5 0
Oliver 7000 0
Gaby 3000 1


I write a function which takes in a column name as an argument, calculates the interquartile range of the values and returns the number of outliers based on that. If I would also like that function to return the number of females that have salaries that are outliers, how do I access the 'sex' column in order to check the corresponding 'sex' value of an outlier salary?

Here is my outlier function:

def iqr_outliers(num_df, column):
nan_count = 0
for value in column:
if value == 'NaN':
nan_count += 1
cleaned_column = [x for x in column if str(x) != "NaN"]
iqr = np.subtract(*np.percentile(cleaned_column, [75, 25]))
upper = np.percentile(cleaned_column, 75) + 1.5 * iqr
lower = np.percentile(cleaned_column, 25) - 1.5 * iqr
outliers = []
lows = 0
highs = 0
fem_outliers= 0
for value in cleaned_column:
if value < lower:
lows += 1
outliers.append(value)
elif value > upper:
highs += 1
outliers.append(value)
return ({"Number of low outliers": lows, "Number of high outliers": highs, "Number of NaNs": nan_count})


Somewhere in those if statements I want to see to check the value of 'sex' for the same row, but I really don't know how to access that.

Answer

Note, you can calculate inter-quartile range with percentile:

In [21]: d
Out[21]:
        Salary  Sex
Joe          1    0
Steph    20000    1
Jared        5    0
Oliver    1000    0
Gaby      3000    1

In [22]: iqr = d.Salary.quantile([.25,.75]).values

In [23]: iqr
Out[23]: array([    5.,  3000.])

Then, you can use elementwise boolean operations:

In [24]: (d.Salary < iqr[0]) | (d.Salary > iqr[1])
Out[24]:
Joe        True
Steph      True
Jared     False
Oliver    False
Gaby      False
Name: Salary, dtype: bool

And finally, you can use the result of that for selection on your entire dataframe:

In [26]: d[(d.Salary < iqr[0]) | (d.Salary > iqr[1])]
Out[26]:
       Salary  Sex
Joe         1    0
Steph   20000    1

Or something to that effect. I do not recall the specifics of Tukey outliers off hand. But it should be easily tractable with the approach illustrated above.

Ok, here is what you need using the Tukey test:

In [40]: IQR = iqr[1] - iqr[0]

In [41]: upper = 1.5*IQR+iqr[1]

In [42]: lower = iqr[0] - 1.5*IQR

In [43]: (d.Salary < lower) | (d.Salary > upper)
Out[43]:
Joe       False
Steph      True
Jared     False
Oliver    False
Gaby      False
Name: Salary, dtype: bool

In [44]: d[(d.Salary < lower) | (d.Salary > upper)]
Out[44]:
       Salary  Sex
Steph   20000    1

To get the number of females, you can just use sum

In [46]: d[(d.Salary < lower) | (d.Salary > upper)]['Sex'].sum()
Out[46]: 1