JaWi JaWi - 9 months ago 49
Python Question

Two Dataframes, one with more columns than the other -> Subtract and Combine

Ok, I know the title may be a little bit confusing, but I will try to explain this in detail:

I use Python 3.5.2:

I got two .csv files that I read via pandas and convert into two separate dataframes. The first dataframe (coming from XYZ.csv) looks like this:

ip community OL123
. IK753

The second (export.csv) just has the "ip" column.

Now what I want to do:

I want to compare the two dataframes and as a result get a third dataframe (or list) that contains all ip-addresses that are in the first dataframe but not in the other WITH their correlating community. So far, I managed to compare the two and get a proper result, as long as the second dataframe also contains the communities. I manually inserted those communites into the second export.csv, unfortunately I cannot automate this and that is why I need this to work without the second dataframe containing the communities.

This is my code:

def compare_csvs():
timestamp = time.strftime("%Y-%m-%d")

# Reads XYZ.csv and creates list that contains all ip addresses in integer format.
A = pd.read_csv("XYZ.csv", index_col=False, header=0)
ips1 = A.ip.tolist()
comu1 = A.ro_community.tolist()
AIP = []
for element1 in ips1:
IPACOM1 = zip(AIP,comu1)

# Reads export.csv and creates list that contains all ip addresses in integer format.
B = pd.read_csv("export" + timestamp + ".csv", index_col=False, header=0)
ips2 = B.ip.tolist()
comu2 = B.ro_community.tolist()
BIP = []
for element2 in ips2:
IPACOM2 = zip(BIP,comu2)

# Creates a set that contains all ip addresses (in integer format) that exist inside the XYZ.csv but not the export.csv.
DeltaInt = OrderedSet(IPACOM1)-OrderedSet(IPACOM2)
List = list(DeltaInt)
UnzippedIP = []
UnzippedCommunity = []
UnzippedIP, UnzippedCommunity = zip(*List)

# Puts all the elements of the DeltaInt set inside a list and also changes the integers back to readable IPv4-addresses.
DeltaIP = []
for element3 in UnzippedIP:

IPandCommunity = zip(DeltaIP,UnzippedCommunity)

Now all I need is something that can compare the two lists I created and keep the "community" with the "ip" it is assigned to. I tried a whole lot but I can't seem to get anything to work. Maybe I am just having a problem with the logic here, all help is appreciated!

Also, excuse the code mess, I just threw all that together and will clean it up once the code actually works.

Answer Source

Here is some dummy data to play with:

This is df:

ip              community        OL123        ACLSH        OKUAJ1     IK753

df = pd.read_clipboard()

This is export.csv:

s_export = pd.Series(s_export = pd.Series(name='ip', data=['','', ''])


Name: ip, dtype: object

To select the ones that aren't in export, we can simply use boolean indexing using isin():

# ~ means 'not', so here that's "find df.ip that is NOT in s_export"
# Store result in a dataframe
df_exclude = df[~df.ip.isin(s_export)]

         ip community
0     OL123
2    OKUAJ1