egodial egodial - 3 months ago 20
Python Question

Python Find highest row in a given column

I'm quite new in stackoverflow and quite recently learnt some basic Python. This is the first time I'm using openpyxl. Before I used xlrd and xlsxwriter and I did manage to make some useful programs. But right now I need a .xlsx reader&writer.

There is a File which I need to read and edit with data already stored in the code. Let's suppose the .xlsx has five columns with data: A, B, C, D, E. In column A, I've over 1000 rows with data. On Column D, I've 150 rows with data.

Basically, I want the program to find the last row with data on a given column (say D). Then, write the stored variable

data
in the next available row (last row + 1) in column D.

The problem is that I can't use
ws.get_highest_row()
because it returns the row 1000 on column A.

Basically, so far this is all I've got:

data = 'xxx'
from openpyxl import load_workbook
wb = load_workbook('book.xlsx', use_iterators=True)
ws = wb.get_sheet_by_name('Sheet1')
last_row = ws.get_highest_row()


Obviously this doesn't work at all.
last_row
returns 1000.

Answer

Here's how to do it using Pandas.

It's easy to get the last non-null row in Pandas using last_valid_index.

There might be a better way to write the resulting DataFrame to your xlsx file but, according to the docs, this very dumb way is actually how it's done in openpyxl.

Let's say you're starting with this simple worksheet:

Original worksheet

Let's say we want to put xxx into column C:

import openpyxl as xl
import pandas as pd

wb = xl.load_workbook('deleteme.xlsx')
ws = wb.get_sheet_by_name('Sheet1')
df = pd.read_excel('deleteme.xlsx')

def replace_first_null(df, col_name, value):
    """
    Replace the first null value in DataFrame df.`col_name`
    with `value`.
    """
    return_df = df.copy()
    idx = list(df.index)
    last_valid = df[col_name].last_valid_index()
    last_valid_row_number = idx.index(last_valid)
    # This next line has mixed number and string indexing
    # but it should be ok, since df is coming from an
    # Excel sheet and should have a consecutive index
    return_df.loc[last_valid_row_number + 1, col_name] = value
    return return_df

def write_df_to_worksheet(ws, df):
    """
    Write the values in df to the worksheet ws in place
    """
    for i, col in enumerate(replaced):
        for j, val in enumerate(replaced[col]):
            if not pd.isnull(val):
                # Python is zero indexed, so add one
                # (plus an extra one to take account
                #  of the header row!)
                ws.cell(row=j + 2, column=i + 1).value = val

# Here's the actual replacing happening
replaced = replace_first_null(df, 'C', 'xxx')
write_df_to_worksheet(ws, df)
wb.save('changed.xlsx')

which results in:

Edited Excel file