Provisional.Modulation Provisional.Modulation - 4 months ago 26
Python Question

Python - Remove duplicate pandas data frames from dictionary

I have a dictionary containing pandas data frames that have the same column names, and I'd like to remove duplicate data frames with identical values and row ids.

Let's assume this is my dictionary of data frames:

>>> dd[0]
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
>>> dd[1]
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
>>> dd[2]
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
>>> dd[3]
Origin Destination Time
1 New York Los Angeles 2016-03-28 04:00:00
2 Los Angeles Boston 2016-03-28 06:00:00
3 Boston New York 2016-03-28 08:00:00
>>> dd[4]
Origin Destination Time
1 New York Los Angeles 2016-03-28 04:00:00
2 Los Angeles Boston 2016-03-28 06:00:00
3 Boston New York 2016-03-28 08:00:00
>>> dd[5]
Origin Destination Time
3 Boston New York 2016-03-28 08:00:00
4 New York Los Angeles 2016-03-28 12:00:00
>>> dd[6]
Origin Destination Time
3 Boston New York 2016-03-28 08:00:00
4 New York Los Angeles 2016-03-28 12:00:00


I want the result to look like this:

>>> dd[0]
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
>>> dd[3]
Origin Destination Time
1 New York Los Angeles 2016-03-28 04:00:00
2 Los Angeles Boston 2016-03-28 06:00:00
3 Boston New York 2016-03-28 08:00:00
>>> dd[5]
Origin Destination Time
3 Boston New York 2016-03-28 08:00:00
4 New York Los Angeles 2016-03-28 12:00:00


This is my code leading up to this above-mentioned example:

# Load data as pandas data frame
data = pd.read_csv("website.txt", names = ["Time", "Origin", `"Destination"])`
data["Time"] = pd.to_datetime(data["Time"], infer_datetime_format=True)
# Reverse data frame by index to loop backwards
data = data.reindex(index=df.index[::-1])
dd = {}
for i, e in reverse.iterrows():
dd[i] = data[ (data['Time'] > e['Time']-pd.Timedelta('4 hours')) & (data['Time'] < e['Time'] + pd.Timedelta('4 hours'))]


Original Text:

{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T08:00:00Z", "Origin": "Boston", "Destination": "New York"}
{"Time": "2016-03-28T08:00:00Z", "Origin": "Boston", "Destination": "New York"}
{"Time": "2016-03-28T12:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T12:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}

Answer

One liner

{k: v.unstack() for k, v in pd.DataFrame({k: v.stack() for k, v in dd.iteritems()}).T.drop_duplicates().iterrows()}

Explained version

# iterate through key, value pairs of dictionary,
# stacking each dataframe into a series so that we
# can pass the resulting dataframe into the pd.DataFrame constructor.
df1 = pd.DataFrame({k: v.stack() for k, v in dd.iteritems()})
# Each column is now one key, value pair from the original dictionary
# Transpose and drop duplicates
df2 = df1.T.drop_duplicates()
# reverse the original stacking and convert back to dictionary
# we could have used df2.T.iteritems() but df2.iterrows() took
# one fewer operations and fewer characters to type.
dd_ = {k: v.unstack() for k, v in df2.iterrows()}

for k, v in dd_.iteritems():
    print 'key {}:'.format(k)
    print v
    print '-' * 10

key 0:
   a  b
0  1  2
1  3  4
----------
key 2:
   a  b
0  2  3
1  4  5
----------

Setup to get same results as me (copy and paste this)

from StringIO import StringIO
import pandas as pd

text0 = """              Origin           Destination                 Time
0           New York                Boston  2016-03-28 02:00:00
1           New York           Los Angeles  2016-03-28 04:00:00
2             Boston           Los Angeles  2016-03-28 06:00:00"""


text1 = """              Origin           Destination                 Time
0           New York                Boston  2016-03-28 02:00:00
1           New York           Los Angeles  2016-03-28 04:00:00
2             Boston           Los Angeles  2016-03-28 06:00:00"""

text2 = """              Origin           Destination                 Time
0           New York                Boston  2016-03-28 02:00:00
1           New York           Los Angeles  2016-03-28 04:00:00
2           Los Angeles             Boston  2016-03-28 06:00:00"""

dd = {}

dd[0] = pd.read_csv(StringIO(text0), sep='\s{2,}', index_col=0, engine='python')
dd[0].Time = pd.to_datetime(dd[0].Time)

dd[1] = pd.read_csv(StringIO(text1), sep='\s{2,}', index_col=0, engine='python')
dd[1].Time = pd.to_datetime(dd[1].Time)

dd[2] = pd.read_csv(StringIO(text2), sep='\s{2,}', index_col=0, engine='python')
dd[2].Time = pd.to_datetime(dd[2].Time)

# Then run solutions above:

df1 = pd.DataFrame({k: v.stack() for k, v in dd.iteritems()})
df2 = df1.T.drop_duplicates()
dd_ = {k: v.unstack() for k, v in df2.iterrows()}

for k, v in dd_.iteritems():
    print 'key {}:'.format(k)
    print v
    print '-' * 10

You should get this:

key 0:
     Origin  Destination                 Time
0  New York       Boston  2016-03-28 02:00:00
1  New York  Los Angeles  2016-03-28 04:00:00
2    Boston  Los Angeles  2016-03-28 06:00:00
----------
key 2:
        Origin  Destination                 Time
0     New York       Boston  2016-03-28 02:00:00
1     New York  Los Angeles  2016-03-28 04:00:00
2  Los Angeles       Boston  2016-03-28 06:00:00
----------

Version

import sys
import pandas as pd
print sys.version
print pd.__version__

2.7.11 |Anaconda custom (x86_64)| (default, Dec  6 2015, 18:57:58) 
[GCC 4.2.1 (Apple Inc. build 5577)]
0.18.1