Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Issues with Unicode characters in UTF-8 text content #4137

Open
1 task done
mqudsi opened this issue Dec 19, 2024 · 4 comments
Open
1 task done

[Bug]: Issues with Unicode characters in UTF-8 text content #4137

mqudsi opened this issue Dec 19, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@mqudsi
Copy link

mqudsi commented Dec 19, 2024

Is there an existing issue for the same bug?

  • I have checked the existing issues.

RAGFlow workspace code commit ID

N/A

RAGFlow image version

5fb9136

Other environment information

Actual behavior

After adding a UTF-8 document to the knowledgebase that contains Unicode symbols, I've noticed that ragflow always generates text fragments that corrupt the Unicode symbols (rendering them as latin-1, apparently).

image

Expected behavior

image

image

Steps to reproduce

Upload a document as a .txt file saved as UTF-8 containing the following:

> “sample text,”

Additional information

No response

@mqudsi mqudsi added the bug Something isn't working label Dec 19, 2024
@KevinHuSh
Copy link
Collaborator

What's LLM do you choose?

@mqudsi
Copy link
Author

mqudsi commented Dec 20, 2024

ChatGPT 4o

@mqudsi
Copy link
Author

mqudsi commented Dec 22, 2024

I don't think it's an issue with the llm integration; here is the same problem displayed in the vector search results on the "Search" page rather than the "Chat" page:

image

@mqudsi
Copy link
Author

mqudsi commented Dec 22, 2024

It seems you are converting to Windows-1252 (a legacy encoding) at some point and then emitting it as if it were UTF-8, because I can correct it with iconv:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants