Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Large-Scale Document Embedding in H2O GPT #1926

Open
BhoomikaMuralidhara opened this issue Jan 22, 2025 · 0 comments
Open

Issue with Large-Scale Document Embedding in H2O GPT #1926

BhoomikaMuralidhara opened this issue Jan 22, 2025 · 0 comments

Comments

@BhoomikaMuralidhara
Copy link

Hi everyone,

A mode was created specifically for an email folder in H2O GPT, where all documents are .docx. An issue has been observed when embedding a large number of documents into this mode.

Here’s what happens:

When embedding fewer documents (e.g., around 100 or less), everything works fine—all documents are successfully added to the database, and new ones can be added without any problems.
However, when embedding a large number of documents (e.g., around 13,000 .docx files), only a portion of the documents (approximately 4,000) appears in the database. After that, adding new documents becomes impossible.
This issue seems specific to the email folder mode. Since all documents are .docx, it doesn’t appear to be related to missing libraries.

Could this behavior be related to:

A database size limit?
Memory constraints?
A misconfiguration in the mode or embedding setup?
Any insights or suggestions for resolving this would be greatly appreciated.

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant