You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When ingesting CSV files, sometimes it fails with "Error("Could not determine delimiter")". This only happens for some CSV files, for others, it works as expected. The bug is arising from the get_delimiter() function.
To Reproduce
Provide a code snippet that reproduces the issue.
from unstructured.partition.csv import get_delimiter
get_delimiter(file_path = "PMID35839768_Correlation_matrix.csv")
Output:
---------------------------------------------------------------------------
Error Traceback (most recent call last)
Cell In[9], line 1
----> 1 get_delimiter(file_path = "PMID35839768_Correlation_matrix.csv")
File [/usr/local/lib/python3.11/site-packages/unstructured/partition/csv.py:124](http://localhost:8886/usr/local/lib/python3.11/site-packages/unstructured/partition/csv.py#line=123), in get_delimiter(file_path, file)
121 with open(file_path) as f:
122 data = f.read(num_bytes)
--> 124 return sniffer.sniff(data, delimiters=[",", ";"]).delimiter
File [/usr/local/lib/python3.11/csv.py:187](http://localhost:8886/usr/local/lib/python3.11/csv.py#line=186), in Sniffer.sniff(self, sample, delimiters)
183 delimiter, skipinitialspace = self._guess_delimiter(sample,
184 delimiters)
186 if not delimiter:
--> 187 raise Error("Could not determine delimiter")
189 class dialect(Dialect):
190 _name = "sniffed"
Error: Could not determine delimiter
Expected behavior
The function returns the delimiter, which is ',' for this file.
Screenshots
Not applicable.
Environment Info
Python 3.11.8
unstructured 0.12.5
Additional context
After looking into this issue for a bit, I found this similar issue for another Python module: Textualize/rich-cli#54 (comment)
Scrolling down further on that same issue thread, I found another comment (Textualize/rich-cli#54 (comment)) that mentions that the example on the official Python csv.Sniffer docs also has the same issue, which may be the source of this bug, since the implementation in unstructured is nearly identical.
Here is a code snippet I used to fix the issue, by reading in whole lines instead of truncating the read mid-line. This same concept should be applied to both instances of the .read() function that appear in get_delimiter() function, they should both be changed to read_lines().
import csv
sniffer = csv.Sniffer()
max_bytes = 8192
with open("PMID35839768_Correlation_matrix.csv") as f:
line_strs = f.readlines(max_bytes) #this returns a list of lines from the file, stopping once the number of lines read exceeds the max byte limit
data = "".join(line_strs)
sniffer.sniff(data, delimiters=[",", ";"]).delimiter
Output: ','
The text was updated successfully, but these errors were encountered:
In addition, sometimes this portion of the code errors out when the open() function encounters a UnicodeDecodeError. I'd recommend passing in errors = 'ignore' to the open() function to allow the delimiter to still be determined instead of erroring out simply because of a stray character that can't be decoded.
import csv
sniffer = csv.Sniffer()
max_bytes = 8192
with open("PMID35839768_Correlation_matrix.csv", errors='ignore') as f:
line_strs = f.readlines(max_bytes) #this returns a list of lines from the file, stopping once the number of lines read exceeds the max byte limit
data = "".join(line_strs)
sniffer.sniff(data, delimiters=[",", ";"]).delimiter
**Summary**
The CSV delimiter-sniffer requires whole lines to properly detect the
delimiter character. Limiting bytes read produced partial lines when
lines were very long. Limit bytes but read whole lines.
Fixes#2643.
Describe the bug
When ingesting CSV files, sometimes it fails with "
Error("Could not determine delimiter")
". This only happens for some CSV files, for others, it works as expected. The bug is arising from theget_delimiter()
function.To Reproduce
Provide a code snippet that reproduces the issue.
PMID35839768_Correlation_matrix.csv
Code snippet, using the above attached file:
Output:
Expected behavior
The function returns the delimiter, which is ',' for this file.
Screenshots
Not applicable.
Environment Info
Python 3.11.8
unstructured 0.12.5
Additional context
After looking into this issue for a bit, I found this similar issue for another Python module: Textualize/rich-cli#54 (comment)
Scrolling down further on that same issue thread, I found another comment (Textualize/rich-cli#54 (comment)) that mentions that the example on the official Python csv.Sniffer docs also has the same issue, which may be the source of this bug, since the implementation in
unstructured
is nearly identical.Here is a code snippet I used to fix the issue, by reading in whole lines instead of truncating the read mid-line. This same concept should be applied to both instances of the
.read()
function that appear inget_delimiter()
function, they should both be changed toread_lines()
.Output:
','
The text was updated successfully, but these errors were encountered: