Jithin Jithin - 1 month ago 8
Python Question

Merge data based on some specific columns, pands

Assume I'm having two data frame's,

t1h
and
t2h
, I want to merge that dataframe in such a way that for a specific list of columns if those rows seem to be similar I need to perform addition operation with contents of rest of the columns.

t1h

timestamp ip domain http_status \
0 1475740500.0 192.168.1.1 example.com 200
1 1475740500.0 192.168.1.1 example.com 200
2 1475740500.0 192.168.1.1 example.com 201
3 1475740500.0 192.168.1.1 example.com 201
4 1475740500.0 192.168.1.1 example.com 202

test b_count b_sum test_count test_sum data1 \
0 False 46 24742949931480 46 9.250 0
1 True 48 28151237474796 48 9.040 0
2 False 36 21702308613722 36 7.896 0
3 True 24 13112423049120 24 5.602 0
4 False 62 29948023487954 62 12.648 0

data2
0 0
1 0
2 0
3 0
4 0


t2h

timestamp ip domain http_status \
0 1475740500.0 192.168.1.1 example.com 200
1 1475740500.0 192.168.1.1 example.com 200
2 1475740500.0 192.168.1.1 example.com 201
3 1475740500.0 192.168.1.1 example.com 201
4 1475740500.0 192.168.1.1 example.com 202

test b_count b_sum test_count test_sum data1 \
0 False 44 22349502626302 44 9.410 0
1 True 32 16859760597754 32 5.988 0
2 False 46 23478212117794 46 8.972 0
3 True 36 20956236750016 36 7.124 0
4 False 54 35255787384306 54 9.898 0

data2
0 0
1 0
2 0
3 0
4 0


based on the below column list I need to get the output,

groupby_fields = ['timestamp', 'ip', 'domain', 'http_status', 'test']

pd.merge(t1h, t2h, on=groupby_fields)

timestamp ip domain http_status \
0 1475740500.0 192.168.1.1 example.com 200
1 1475740500.0 192.168.1.1 example.com 200
2 1475740500.0 192.168.1.1 example.com 201
3 1475740500.0 192.168.1.1 example.com 201
4 1475740500.0 192.168.1.1 example.com 202

test b_count_x b_sum_x test_count_x test_sum_x \
0 False 46 24742949931480 46 9.250
1 True 48 28151237474796 48 9.040
2 False 36 21702308613722 36 7.896
3 True 24 13112423049120 24 5.602
4 False 62 29948023487954 62 12.648

data1_x data2_x b_count_y b_sum_y \
0 0 0 44 22349502626302
1 0 0 32 16859760597754
2 0 0 46 23478212117794
3 0 0 36 20956236750016
4 0 0 54 35255787384306

test_count_y test_sum_y data1_y data2_y
0 44 9.410 0 0
1 32 5.988 0 0
2 46 8.972 0 0
3 36 7.124 0 0
4 54 9.898 0 0


I want it in such a way that the output should look like,

Note: except the columns in
groupby_fields
every other column are of either type either
int
or
float


timestamp ip domain http_status \
0 1475740500.0 192.168.1.1 example.com 200
1 1475740500.0 192.168.1.1 example.com 200
2 1475740500.0 192.168.1.1 example.com 201
3 1475740500.0 192.168.1.1 example.com 201
4 1475740500.0 192.168.1.1 example.com 202

test b_count b_sum test_count test_sum \
0 False 90 47092452557782 90 18.660
1 True 80 45010998072550 80 15.028
2 False 82 45180520731516 82 16.868
3 True 60 34068659799136 60 12.726
4 False 116 65203810872260 116 22.546

data1 data2 \
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0


Please let me know how can I achieve that in an optimized way.

Answer

Great use case for the groupby.agg() function

Assuming that t1h and t2h already exist, and have the same column names

groupby_fields = ['timestamp', 'ip', 'domain', 'http_status', 'test']

df = t2h.append(t2h, ignore_index = True)

agg_dict = {'b_count':'count',
        'b_sum':'sum',
        'test_count':'count',
        'test_sum':'sum',
        'data1':'sum',
        'data2':'sum'}

df.groupby(groupby_fields).agg(agg_dict).reset_index()