Frame Frame - 2 years ago 143
Python Question

Get rows that are not in set (like SQL Except)

What I'm trying to do is delete several rows of an Excel-Files (with pandas) and then save the File without those rows to .xlsx (with pyexcelerate module).

I'm aware that I can remove rows of a data frame by dropping them (I already got that to work). But I have read in several posts that when there are many (in my case > 5000) rows that should be deleted it's much faster to just get the indexes of the "to delete" rows from the data frame and then slice the data frame (just as a SQL Except statement for example would do).
Unfortunately I can't get it to work, even though I've tried several methods.

Here are my "source posts":

Slice Pandas dataframe by labels that are not in a list - Answer from User ASGM

How to drop a list of rows from Pandas dataframe? - Answer from User Dennis Golomazov

And here is a part of the function, that should delete the rows and save the created file:

for index, cell in enumerate(wb_in[header_xlsx]):
if str(cell) in delete_set:
set_to_delete.append(index)
print str(cell) + " deleted from set: " + str(len(set_to_delete))
wb_out = Workbook()
data_out = wb_in.loc[set(wb_in.index) - set(set_to_delete)]
ws_out = wb_out.new_sheet('Main', data=data_out)
wb_out.save(file_path + filename + "_2.xlsx")


Here is an example of the data frame:

sku product_group name \
0 ABAAb00610-23.0 ABA1 Anti-Involucrin [SY5]
1 ABAAb00610-10.0 ABA1 Anti-Involucrin [SY5]
2 ABAAb00610-1.1 ABA1 Anti-EpCAM [AUA1]
3 ABAAb00609-23.0 ABA1 Anti-EpCAM [AUA1]
4 ABAAb00609-10.0 ABA1 Anti-EpCAM [AUA1]
5 ABAAb00609-1.1 ABA1 Anti-EpCAM [AUA1]
6 ABAAb00608-23.0 ABA1 Anti-EpCAM [AUA1]
7 ABAAb00608-10.0 ABA1 Anti-EpCAM [AUA1]
8 ABAAb00608-3.3 ABA1 Anti-EpCAM [AUA1]
9 ABAAb00608-3.0 ABA1 Anti-EpCAM [AUA1]


Delete_set is a set that contains only skus (e.g.: ABAAb00608-3.3 or ABAAb00609-1.1).

Btw: I have tried many solution suggestions!

Thanks in advance!

Answer Source

Use pd.Series.isin:

df = df[~df.sku.isin(delete_set)]

print(df)
               sku product_group                   name
0  ABAAb00610-23.0          ABA1  Anti-Involucrin [SY5]
1  ABAAb00610-10.0          ABA1  Anti-Involucrin [SY5]
2   ABAAb00610-1.1          ABA1      Anti-EpCAM [AUA1]
3  ABAAb00609-23.0          ABA1      Anti-EpCAM [AUA1]
4  ABAAb00609-10.0          ABA1      Anti-EpCAM [AUA1]
5   ABAAb00609-1.1          ABA1      Anti-EpCAM [AUA1]
6  ABAAb00608-23.0          ABA1      Anti-EpCAM [AUA1]
7  ABAAb00608-10.0          ABA1      Anti-EpCAM [AUA1]
8   ABAAb00608-3.3          ABA1      Anti-EpCAM [AUA1]
9   ABAAb00608-3.0          ABA1      Anti-EpCAM [AUA1]

print(delete_set)
('ABAAb00608-3.3', 'ABAAb00609-1.1')

m = ~df.sku.isin(delete_set)
print(m) 
0     True
1     True
2     True
3     True
4     True
5    False
6     True
7     True
8    False
9     True
Name: sku, dtype: bool

print(df[m])
               sku product_group                   name
0  ABAAb00610-23.0          ABA1  Anti-Involucrin [SY5]
1  ABAAb00610-10.0          ABA1  Anti-Involucrin [SY5]
2   ABAAb00610-1.1          ABA1      Anti-EpCAM [AUA1]
3  ABAAb00609-23.0          ABA1      Anti-EpCAM [AUA1]
4  ABAAb00609-10.0          ABA1      Anti-EpCAM [AUA1]
6  ABAAb00608-23.0          ABA1      Anti-EpCAM [AUA1]
7  ABAAb00608-10.0          ABA1      Anti-EpCAM [AUA1]
9   ABAAb00608-3.0          ABA1      Anti-EpCAM [AUA1]
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download