Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/get_delimiter() fails when file read stops in the middle of a line #2643

Closed
Sammi-Smith opened this issue Mar 13, 2024 · 3 comments · Fixed by #2998
Closed

bug/get_delimiter() fails when file read stops in the middle of a line #2643

Sammi-Smith opened this issue Mar 13, 2024 · 3 comments · Fixed by #2998
Assignees
Labels
bug Something isn't working

Comments

@Sammi-Smith
Copy link

Describe the bug
When ingesting CSV files, sometimes it fails with "Error("Could not determine delimiter")". This only happens for some CSV files, for others, it works as expected. The bug is arising from the get_delimiter() function.

To Reproduce
Provide a code snippet that reproduces the issue.

PMID35839768_Correlation_matrix.csv

Code snippet, using the above attached file:

from unstructured.partition.csv import get_delimiter
get_delimiter(file_path = "PMID35839768_Correlation_matrix.csv")

Output:

---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
Cell In[9], line 1
----> 1 get_delimiter(file_path = "PMID35839768_Correlation_matrix.csv")

File [/usr/local/lib/python3.11/site-packages/unstructured/partition/csv.py:124](http://localhost:8886/usr/local/lib/python3.11/site-packages/unstructured/partition/csv.py#line=123), in get_delimiter(file_path, file)
    121     with open(file_path) as f:
    122         data = f.read(num_bytes)
--> 124 return sniffer.sniff(data, delimiters=[",", ";"]).delimiter

File [/usr/local/lib/python3.11/csv.py:187](http://localhost:8886/usr/local/lib/python3.11/csv.py#line=186), in Sniffer.sniff(self, sample, delimiters)
    183     delimiter, skipinitialspace = self._guess_delimiter(sample,
    184                                                         delimiters)
    186 if not delimiter:
--> 187     raise Error("Could not determine delimiter")
    189 class dialect(Dialect):
    190     _name = "sniffed"

Error: Could not determine delimiter

Expected behavior
The function returns the delimiter, which is ',' for this file.

Screenshots
Not applicable.

Environment Info
Python 3.11.8
unstructured 0.12.5

Additional context
After looking into this issue for a bit, I found this similar issue for another Python module: Textualize/rich-cli#54 (comment)

Scrolling down further on that same issue thread, I found another comment (Textualize/rich-cli#54 (comment)) that mentions that the example on the official Python csv.Sniffer docs also has the same issue, which may be the source of this bug, since the implementation in unstructured is nearly identical.

Here is a code snippet I used to fix the issue, by reading in whole lines instead of truncating the read mid-line. This same concept should be applied to both instances of the .read() function that appear in get_delimiter() function, they should both be changed to read_lines().

import csv
sniffer = csv.Sniffer()
max_bytes = 8192
with open("PMID35839768_Correlation_matrix.csv") as f:
    line_strs = f.readlines(max_bytes) #this returns a list of lines from the file, stopping once the number of lines read exceeds the max byte limit
    data = "".join(line_strs)
sniffer.sniff(data, delimiters=[",", ";"]).delimiter

Output:
','

@Sammi-Smith Sammi-Smith added the bug Something isn't working label Mar 13, 2024
@Sammi-Smith
Copy link
Author

@awalker4 it looks like this is the commit where the bug was introduced:
d594c06

Tagging you since you would probably be the best one to make this minor fix! :)

@awalker4
Copy link
Contributor

Sorry for the delay! This does look like a good fix. I can try to get it to soon, but if you have a chance to make a pr that would be a huge help 🙏

@Sammi-Smith
Copy link
Author

In addition, sometimes this portion of the code errors out when the open() function encounters a UnicodeDecodeError. I'd recommend passing in errors = 'ignore' to the open() function to allow the delimiter to still be determined instead of erroring out simply because of a stray character that can't be decoded.

import csv
sniffer = csv.Sniffer()
max_bytes = 8192
with open("PMID35839768_Correlation_matrix.csv", errors='ignore') as f:
    line_strs = f.readlines(max_bytes) #this returns a list of lines from the file, stopping once the number of lines read exceeds the max byte limit
    data = "".join(line_strs)
sniffer.sniff(data, delimiters=[",", ";"]).delimiter

@scanny scanny self-assigned this May 5, 2024
github-merge-queue bot pushed a commit that referenced this issue May 10, 2024
**Summary**
The CSV delimiter-sniffer requires whole lines to properly detect the
delimiter character. Limiting bytes read produced partial lines when
lines were very long. Limit bytes but read whole lines.

Fixes #2643.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants