What I'm trying to do is delete several rows of an Excel-Files (with pandas) and then save the File without those rows to .xlsx (with pyexcelerate module).
I'm aware that I can remove rows of a data frame by dropping them (I already got that to work). But I have read in several posts that when there are many (in my case > 5000) rows that should be deleted it's much faster to just get the indexes of the "to delete" rows from the data frame and then slice the data frame (just as a SQL Except statement for example would do).
Unfortunately I can't get it to work, even though I've tried several methods.
Here are my "source posts":
Slice Pandas dataframe by labels that are not in a list - Answer from User ASGM
How to drop a list of rows from Pandas dataframe? - Answer from User Dennis Golomazov
And here is a part of the function, that should delete the rows and save the created file:
for index, cell in enumerate(wb_in[header_xlsx]):
if str(cell) in delete_set:
set_to_delete.append(index)
print str(cell) + " deleted from set: " + str(len(set_to_delete))
wb_out = Workbook()
data_out = wb_in.loc[set(wb_in.index) - set(set_to_delete)]
ws_out = wb_out.new_sheet('Main', data=data_out)
wb_out.save(file_path + filename + "_2.xlsx")
sku product_group name \
0 ABAAb00610-23.0 ABA1 Anti-Involucrin [SY5]
1 ABAAb00610-10.0 ABA1 Anti-Involucrin [SY5]
2 ABAAb00610-1.1 ABA1 Anti-EpCAM [AUA1]
3 ABAAb00609-23.0 ABA1 Anti-EpCAM [AUA1]
4 ABAAb00609-10.0 ABA1 Anti-EpCAM [AUA1]
5 ABAAb00609-1.1 ABA1 Anti-EpCAM [AUA1]
6 ABAAb00608-23.0 ABA1 Anti-EpCAM [AUA1]
7 ABAAb00608-10.0 ABA1 Anti-EpCAM [AUA1]
8 ABAAb00608-3.3 ABA1 Anti-EpCAM [AUA1]
9 ABAAb00608-3.0 ABA1 Anti-EpCAM [AUA1]
Use pd.Series.isin
:
df = df[~df.sku.isin(delete_set)]
print(df)
sku product_group name
0 ABAAb00610-23.0 ABA1 Anti-Involucrin [SY5]
1 ABAAb00610-10.0 ABA1 Anti-Involucrin [SY5]
2 ABAAb00610-1.1 ABA1 Anti-EpCAM [AUA1]
3 ABAAb00609-23.0 ABA1 Anti-EpCAM [AUA1]
4 ABAAb00609-10.0 ABA1 Anti-EpCAM [AUA1]
5 ABAAb00609-1.1 ABA1 Anti-EpCAM [AUA1]
6 ABAAb00608-23.0 ABA1 Anti-EpCAM [AUA1]
7 ABAAb00608-10.0 ABA1 Anti-EpCAM [AUA1]
8 ABAAb00608-3.3 ABA1 Anti-EpCAM [AUA1]
9 ABAAb00608-3.0 ABA1 Anti-EpCAM [AUA1]
print(delete_set)
('ABAAb00608-3.3', 'ABAAb00609-1.1')
m = ~df.sku.isin(delete_set)
print(m)
0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 False
9 True
Name: sku, dtype: bool
print(df[m])
sku product_group name
0 ABAAb00610-23.0 ABA1 Anti-Involucrin [SY5]
1 ABAAb00610-10.0 ABA1 Anti-Involucrin [SY5]
2 ABAAb00610-1.1 ABA1 Anti-EpCAM [AUA1]
3 ABAAb00609-23.0 ABA1 Anti-EpCAM [AUA1]
4 ABAAb00609-10.0 ABA1 Anti-EpCAM [AUA1]
6 ABAAb00608-23.0 ABA1 Anti-EpCAM [AUA1]
7 ABAAb00608-10.0 ABA1 Anti-EpCAM [AUA1]
9 ABAAb00608-3.0 ABA1 Anti-EpCAM [AUA1]