Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible segfault reading python-written xlsx #928

Open
earthshrink opened this issue Dec 19, 2024 · 3 comments
Open

Reproducible segfault reading python-written xlsx #928

earthshrink opened this issue Dec 19, 2024 · 3 comments

Comments

@earthshrink
Copy link

Synopsis

Sc-im segfaults on .xlsx files generated or updated using openpyxl python library. The generated .xlsx files are "legal" in the sense that Gnumeric, LibreOffice and Google Sheets have no issues reading them.

Steps to reproduce

Python test script

#!/usr/bin/env python
"""sc-im xlsx segfault test"""

import pandas as pd

foo = pd.DataFrame(index=range(1,2), columns=['X', 'Y'])
foo['X'] = foo.index

foo.at[1, 'Y'] = 0
foo.at[2, 'Y'] = 1

print(foo)

with pd.ExcelWriter("foo.xlsx", engine="openpyxl") as writer:
    foo.to_excel(writer, sheet_name="Crashme", index=False)

# end

Execution

Save and run the python test script. Verify that the generated file foo.xlsx loads in Gnumeric, LibreOffice and in Google Sheets.
Run: sc-im --quit_afterload --nocurses foo.xlsx - this will segfault.
If run under gdb (or with C stacktrace library using -ldynamic in Makefile), the following stacktrace is produced:

0 /usr/src/sc-im/src/sc-im(stacktrace+0x33) [0x55f13f1439d8]
1 /usr/lib64/libc.so.6(+0x3c3f0) [0x7f8787dca3f0]
2 /usr/src/sc-im/src/sc-im(get_sheet_data+0x5f4) [0x55f13f15737b]
3 /usr/src/sc-im/src/sc-im(open_xlsx+0x87b) [0x55f13f158644]
4 /usr/src/sc-im/src/sc-im(readfile+0x4ae) [0x55f13f121207]
5 /usr/src/sc-im/src/sc-im(load_tbl+0x135) [0x55f13f126984]
6 /usr/src/sc-im/src/sc-im(load_file+0x1da) [0x55f13f126814]
7 /usr/src/sc-im/src/sc-im(main+0x2fd) [0x55f13f142cba]
8 /usr/lib64/libc.so.6(+0x26470) [0x7f8787db4470]
9 /usr/lib64/libc.so.6(__libc_start_main+0x89) [0x7f8787db4529]
10 /usr/src/sc-im/src/sc-im(_start+0x25) [0x55f13f119d15]

Workaround

Gnumeric's ssconvert seems to fix the generated file such that sc-im is able to read without segfault.
Run: ssconvert foo.xlsx bar.xlsx.
With this test file, ssconvert generates a warning but bar.xlsx opens in sc-im without segfault. I do not get this warning with some larger spreadsheets I was working with, but the behaviour is the same, viz segfault on the generated or updated file and no segfault after ssconvert.

@rusq
Copy link

rusq commented Dec 28, 2024

I did a little bit of digging out of curiosity, looks like it happens here:

 if (get_conf_int("xlsx_readformulas") &&
                        // dont handle shared formulas right now
                        ! (xmlHasProp(child_node->xmlChildrenNode, (xmlChar *) "t") &&
                        ! strcmp((shared = (char *) xmlGetProp(child_node->xmlChildrenNode, (xmlChar *) "t")), "shared"))
                    ) 

and the problem is that xlsx generated by pandas is lacking "t=" attribute. If opened/saved by excel, and compared two XMLs, the "t=" attribute appears.

So, on pandas-generated xls the xmlGetProp return NULL, and strcmp shits the bed.

I can't compile on my machine to verify tho.

@earthshrink
Copy link
Author

@rusq - thanks! Walking the test case through gdb uncovered a null chlld node, albeit at an earlier line in the code. The fix seems to work also on a larger production file that had led me to this issue.
PR #931.

@andmarti1424
Copy link
Owner

Thank you @earthshrink. WIll check this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants