kbball kbball -4 years ago 147
Python Question

Optimize data conversion program to avoid memory error

I am very new to Python and have idea how to optimize this program in order to avoid a memory error.

I am trying to read data from two workbooks: raw_data and mapping. I want to convert the raw_data into a new spreadsheet using the mapping document to transform the data. So I load in the workbooks, create data dictionaries out of the mapping data and start converting. However, I run into a memory error.

Is there any way to optimize the code below to avoid this error?

import openpyxl
from openpyxl.utils import get_column_letter

mapping = openpyxl.load_workbook(r'C:...\mapping.xlsx') #load mapping doc
wb = openpyxl.load_workbook(r'C:...\raw_data.xlsx') #load raw data

sheet = wb.active #look at the active sheet in the raw data file
user_map_raw = mapping.get_sheet_by_name('User ID Mapping') #for user ids
item_map_raw = mapping.get_sheet_by_name('Item ID Mapping') #for item ids
...other mappings here

def load(sheet):

user_dict = {}
print "creating user dictionary..."
for row in range(1, user_map_raw.max_row+1):
old_name = user_map_raw['A' + str(row)].value #old user name
new_name = user_map_raw['B' + str(row)].value #new user name
user_dict[old_name] = new_name #old name is key for the new name

item_dict = {}
print "creating item id dictionary..."
for row in range(1, item_map_raw.max_row+1):
old_item = item_map_raw['A' + str(row)].value #old item id
new_item = item_map_raw['B' + str(row)].value #new item id
item_dict[old_item] = new_item #old item id is key for new item id

raw = [] #empty list to store data before writing to new file
for row in range(2, sheet.max_row+1): #loop thru raw data and map
print "loading row %s" % row
user_ID = user_dict[sheet['A' + str(row)].value]
item_type = sheet['B' + str(row)].value
item_ID = item_dict[sheet['C' + str(row)].value]
...other transformations here
add = [user_ID, item_type, item_ID, ...]
raw.append(add) #add transformed data to list

new = openpyxl.Workbook() #create new workbook
output = new.active #select the active sheet
for i in range(len(raw)): #loop through transformed data list
"print writing row %s" %i
for j in range(len(raw[i])): #write to new sheet
output[get_column_letter(j+1) + str(i+1)] = raw[i][j]


Answer Source

The primary optimization would be to avoid loading the entire primary workbook into memory as well as avoiding storing the whole result in memory before writing. There are write_only and read_only modes for openpyxl workbooks that can save a lot of memory at runtime by implementing optimized representations with reduced features and support for iterators. Since you are writing a new file instead of editing in place these modes can make a big difference.

wb = openpyxl.load_workbook(r'C:...\raw_data.xlsx', read_only=True) 
sheet = wb.active

# mapping related code...

from openpyxl.writer.write_only import WriteOnlyCell
wb = openpyxl.Workbook(write_only=True) #create new workbook
ws = new.create_sheet()

for row in sheet.iter_rows(row_offset=1):
    for i, cell in enumerate(row):
        if i = 0: #A
            user_ID = WriteOnlyCell(ws, user_dict[cell.value])
        elif i = 1: #B
            item_type = WriteOnlyCell(ws, cell.value)
        elif i = 2: #C
            item_ID = WriteOnlyCell(ws, item_dict[cell.value])
    ws.append([user_ID, item_type, item_ID])


Have to iterate over the cells since it's a generator so can't use subscript. Seems clunky, but I'm tired.

For a small savings, if you are using Python 2.x, whenever you use the range function a list is created in memory that's as big as your range, which if you have a very large spreadsheet, could fill up your RAM. In your case you can probably use xrange which generates each iteration dynamically to save some memory.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download