Notes from migrating from XLRD to OpenPyXL
Background
I recently needed to migrate a service from using XLRD
to OpenPyXL. The service was old and needed to support .xlsx
file types,
while XLRD has since dropped support for non-.xls
file types. There were
a few issues I ran into while trying to use OpenPyXL as a drop-in replacement, so I wanted to put them in one place for
reference.
Implementation
The “standard” way to use OpenPyXL is to use pandas. Due to organizational requirements, I could not use pandas and instead had to use OpenPyXL directly.
Shared Context
Each of these issues was encountered with a workbook imported with the following code:
import openpyxl
book = openpyxl.load_workbook(filename="path/to/file")
sheet = book.worksheets[0]
Issues 1 & 2: 1-indexing and Value Access
The worksheet itself is 1-indexed along its rows. To access the first row of a sheet, you would use
row = sheet[1]
Further, the cells in each row are cell objects instead of raw values. To get the raw values,
row_values = [cell.value for cell in row]
Another difference between XLRD and OpenPyXL is the default value in cells. For XLRD, it’s the empty string. For
OpenPyXL, it’s None
. For my use case, I had to modify the list comprehension to be
row_values = [cell.value for cell if cell.value is not None else "" in row]
Issue 3: Performance
The service I had inherited looped over the sheet using the index to access each row. For cell-by-cell validation across ~10,000 rows, the execution time went from ~2 minutes to timing out after finishing ~15% in 15 minutes. I did not extrapolate to 60 minutes because the execution time grew for each row, growing by ~10x by ~15%. I did not profile it much further, but this had the hallmarks of file access overhead.
After a few hours trying to find the cause of the problem, the solution was to refactor the code to iterate over the sheet object instead of accessing the rows by index.
# Bad
for i in range(1, 10000):
perform_validation(sheet[i])
# Good
for row in sheet:
perform_validation(row)
This still resulted in a performance penalty over XLRD, but execution time only rose to 4 minutes as opposed to multiple hours.