Ken Ken - 19 days ago 41
Python Question

Pandas dataframe searching by 3 columns and criteria

I have a pandas

dataframe
of census data from the united states. Columns include
STNAME
(state name)
CTYNAME
(county name) and many columns of different pieces of data collected in census reports by year.

I am looking for the top 3 most populous states, only including the top 3 most populous counties per states.

here is my current code which straightens up the original dataframe a bit and only includes what is relevant to the problem:

def answer():
census_df50 = census_df[census_df['SUMLEV'] == 50]
columns = ['STNAME', 'CTYNAME', 'CENSUS2010POP']
c = census_df50[columns]
return c
print(answer())


and here is an example of the dataframe printed in the terminal:

STNAME CTYNAME CENSUS2010POP
1 Alabama Autauga County 54571
2 Alabama Baldwin County 182265
3 Alabama Barbour County 27457
4 Alabama Bibb County 22915
5 Alabama Blount County 57322
6 Alabama Bullock County 10914
7 Alabama Butler County 20947
8 Alabama Calhoun County 118572


it is listed alphabetically by state then county, so this only shows the census data for alabama by county name, but there are over 3000 lines in the dataframe (one for each county, states have multiple entries)

My approach would be to write a function find the top 3 counties in
CTYNAME
, based on their
CENSUS2016POP
value (an integer representing population in 2010), for each state in column
STNAME
. Then have the function return the names as strings of the top 3 states by this criteria. Pretty lost on how to accomplish this, however.

i am pretty sure i should be trying to use a combination of these functions
.groupby()
,
.set_index()
and
.nlargest()
.

any help would be appreciated!

Answer

Since you didn't provide any sample data, here's some:

STNAME,CTYNAME,POPULATION
A,A1,100
A,A2,20
A,A3,30
A,A4,40
B,B1,10
B,B2,2
B,B3,30
B,B4,40
C,C1,100
C,C2,20
C,C3,300
C,C4,40
D,D1,10
D,D2,20
D,D3,30
D,D4,40

In [1]: df = pd.read_clipboard(sep=',')

Here's a way to achieve the desired result, maybe there's simpler but I couldn't reduce it further:

In [2]: df.ix[df['STNAME'].isin(df.groupby('STNAME')['POPULATION'].sum().nlargest(3).index)].groupby(['STNAME','CTYNAME']).sum()['POPULATION'].groupby(level=0, group_keys=False).nlargest(3)

To decompose:

# Find the first STNAME Groups
In [3]: largest_states = df.groupby('STNAME')['POPULATION'].sum().nlargest(3).index
        largest_states

Out[3]: Index(['C', 'A', 'D'], dtype='object', name='STNAME')

Filter the df on these:

In [4]: df2 = df.ix[df['STNAME'].isin(df.groupby('STNAME')['POPULATION'].sum().nlargest(3).index)]

Then find the first three within:

In [5]: df2.groupby(['STNAME','CTYNAME']).sum()['POPULATION'].groupby(level=0, group_keys=False).nlargest(3)

Out[5]:
STNAME  CTYNAME
A       A1         100
        A4          40
        A3          30
C       C3         300
        C1         100
        C4          40
D       D4          40
        D3          30
        D2          20
Name: POPULATION, dtype: int64

Update:

If you want it to be sorted by state population, do it in this order:

In[6]: df2 = df.groupby(['STNAME','CTYNAME']).sum()['POPULATION'].groupby(level=0, group_keys=False).nlargest(3)

       df2.ix[df.groupby('STNAME')['POPULATION'].sum().nlargest(3).index]

Out[6]:
STNAME  CTYNAME
C       C3         300
        C1         100
        C4          40
A       A1         100
        A4          40
        A3          30
D       D4          40
        D3          30
        D2          20
Name: POPULATION, dtype: int64