Donbeo Donbeo - 16 days ago 5
Python Question

pandas group by 3 variables but sum over 2 of them

I have a data frame :

In [59]: df.head()
Out[59]:

src-subnet dst-subnet bytes Start
0 ATL Internet 180 1477252800
1 ATL ATL 206 1477252800
2 ATL Private-10.0.0.0 17160 1477252800
3 SAT Internet 15768 1477252800
4 AWS-CloudFront ATL 1592 1477252800


for each value of start I want the sum of the bytes that are transmitted between each possible pair of
src-subnet
and
dst-subnet
.

I wrote the following:

g_start = df.groupby('Start')
g_start_subnet = g_start.apply(lambda x: x.groupby(['src-subnet', 'dst-subnet']).sum())


and I obtained:

```

In [60]: g_start_subnet
Out[60]:

bytes
Start src-subnet dst-subnet
1476896400 ATL ATL 5190647907
AWS 67442
AWS-CloudFront 3523866
AWS-EC2 10231320
Broadcast-255.255.255.255 550470
COL 25459487
Corvil-CNE 5130
DEN 68247258
DFB 49930518
DFW 109329393
HOU 43601191
IND 97444445
Internal Network 2295849
Internet 72555057
JES 64789642
LAF 30442304
LOU 27372802
MGO 307340
Multicast-224.0.0.0 25339730
NOL 49526023
Private-10.0.0.0 814280925
Private-172.16.0.0 9348122
Private-192.168.0.0 8907521
SAT 43590577
SCI 2473674
WMSApp 74085307
WMSDB 9604726
AWS ATL 78309
AWS-CloudFront ATL 10026686
HOU 9285
... ...
1477497600 Private-10.0.0.0 Internet 90581028
Link-169.254.0.0 276
Private-10.0.0.0 159103776
Private-192.168.0.0 60724723
WMSApp 1100
Private-100.64.0.0 ATL 1304731
Private-192.168.0.0 ATL 77031
Internet 926
Multicast-224.0.0.0 238220
Private-10.0.0.0 19320
Private-192.168.0.0 393982
SAT ATL 18278815
AWS-EC2 28380
Internet 42773386
Private-10.0.0.0 8030664
Private-172.16.0.0 230389
Private-192.168.0.0 5938773
SCI ATL 2055407
Internet 900
Multicast-224.0.0.0 256
Solace-Servers Broadcast-255.255.255.255 153438
WMSApp ATL 95861882
AWS-EC2 360
DFB 3066
DFW 9000
Internet 3330
Multicast-224.0.0.0 31292117
Private-10.0.0.0 1022
WMSDB ATL 8678562
Internet 2160


How can I know remove the multiindex and obtain a matrix like dataframe?

The final result should have the columns
Start, bytes, src-subnet, dst-subnet

Answer

I think you can use:

g_start_subnet = g_start_subnet.reset_index()

But maybe better is use:

print (df.groupby(['Start', 'src-subnet', 'dst-subnet'], as_index=False).sum())
Comments