Niche Content for No One

Notes from migrating from XLRD to OpenPyXL

Background

I recently needed to migrate a service from using XLRD to OpenPyXL. The service was old and needed to support .xlsx file types, while XLRD has since dropped support for non-.xls file types. There were a few issues I ran into while trying to use OpenPyXL as a drop-in replacement, so I wanted to put them in one place for reference.

Implementation

The “standard” way to use OpenPyXL is to use pandas. Due to organizational requirements, I could not use pandas and instead had to use OpenPyXL directly.

Shared Context

Each of these issues was encountered with a workbook imported with the following code:

import openpyxl

book = openpyxl.load_workbook(filename="path/to/file")
sheet = book.worksheets[0]

Issues 1 & 2: 1-indexing and Value Access

The worksheet itself is 1-indexed along its rows. To access the first row of a sheet, you would use

row = sheet[1]

Further, the cells in each row are cell objects instead of raw values. To get the raw values,

row_values = [cell.value for cell in row]

Another difference between XLRD and OpenPyXL is the default value in cells. For XLRD, it’s the empty string. For OpenPyXL, it’s None. For my use case, I had to modify the list comprehension to be

row_values = [cell.value for cell if cell.value is not None else "" in row]

Issue 3: Performance

The service I had inherited looped over the sheet using the index to access each row. For cell-by-cell validation across ~10,000 rows, the execution time went from ~2 minutes to timing out after finishing ~15% in 15 minutes. I did not extrapolate to 60 minutes because the execution time grew for each row, growing by ~10x by ~15%. I did not profile it much further, but this had the hallmarks of file access overhead.

After a few hours trying to find the cause of the problem, the solution was to refactor the code to iterate over the sheet object instead of accessing the rows by index.

# Bad
for i in range(1, 10000):
    perform_validation(sheet[i])

# Good
for row in sheet:
    perform_validation(row)

This still resulted in a performance penalty over XLRD, but execution time only rose to 4 minutes as opposed to multiple hours.