Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for docx, json, csv etc. #246

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

spacepirate0001
Copy link

Added support for multiple document types; json, html, txt, docx and csv.

@ViniciusTheCoder
Copy link

to do the html ingest I just need to paste the entire html in the docs folder and run ingest?

@spacepirate0001
Copy link
Author

to do the html ingest I just need to paste the entire html in the docs folder and run ingest?

Correct! Once its approved and merged.

@kevinbaroro
Copy link

Hi, I had an issue when ingesting a csv file. The error says "Column text not found in CSV file".
I have this CSV data

Artifact Type, Primary Text, Name, Description, Owner
MyRequirementType, "The vehicle must have two wheels.", "Vehicle wheels", "This requirement defines the rules for vehicles", "Joe Blogs"

Is there any reason why? Thank you.

@kevinbaroro
Copy link

Hi, I had an issue when ingesting a csv file. The error says "Column text not found in CSV file". I have this CSV data

Artifact Type, Primary Text, Name, Description, Owner MyRequirementType, "The vehicle must have two wheels.", "Vehicle wheels", "This requirement defines the rules for vehicles", "Joe Blogs"

Is there any reason why? Thank you.

SOLVED!!!

The error gone away when changing this code in ingest-data.ts

From
'.csv': (path) => new CSVLoader(path, "text")
To
'.csv': (path) => new CSVLoader(path)

@pinballelectronica
Copy link

Appreciate that. I am trying to get the doc loaders working with Chroma and this was my last problem.

Hi, I had an issue when ingesting a csv file. The error says "Column text not found in CSV file". I have this CSV data
Artifact Type, Primary Text, Name, Description, Owner MyRequirementType, "The vehicle must have two wheels.", "Vehicle wheels", "This requirement defines the rules for vehicles", "Joe Blogs"
Is there any reason why? Thank you.

SOLVED!!!

The error gone away when changing this code in ingest-data.ts

From '.csv': (path) => new CSVLoader(path, "text") To '.csv': (path) => new CSVLoader(path)

@georgia210
Copy link

Hi, I encountered a stack overflow issue when ingesting a large docx file. It worked with PDF loader. Do you have any idea about the reason? Thank you!

@molgit
Copy link

molgit commented Aug 27, 2023

I replaced the 3 changed files and got this error when running ingest:

`> [email protected] ingest

tsx -r dotenv/config scripts/ingest-data.ts

node:internal/errors:484
ErrorCaptureStackTrace(err);
^

Error [ERR_PACKAGE_PATH_NOT_EXPORTED]: Package subpath './document_loaders/fs/html' is not defined by "exports" in /home/runner/gpt4-pdf-chatbot-langchain/node_modules/langchain/package.json imported from /home/runner/gpt4-pdf-chatbot-langchain/scripts/ingest-data.ts
at __node_internal_captureLargerStackTrace (node:internal/errors:484:5)
at new NodeError (node:internal/errors:393:5)
at throwExportsNotFound (node:internal/modules/esm/resolve:358:9)
at packageExportsResolve (node:internal/modules/esm/resolve:668:3)
at packageResolve (node:internal/modules/esm/resolve:843:14)
at moduleResolve (node:internal/modules/esm/resolve:909:20)
at defaultResolve (node:internal/modules/esm/resolve:1124:11)
at nextResolve (node:internal/modules/esm/loader:163:28)
at u (file:///home/runner/gpt4-pdf-chatbot-langchain/node_modules/@esbuild-kit/esm-loader/dist/index.js:1:2406)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async nextResolve (node:internal/modules/esm/loader:163:22)
at async ESMLoader.resolve (node:internal/modules/esm/loader:841:24)
at async ESMLoader.getModuleJob (node:internal/modules/esm/loader:424:7)
at async ModuleWrap. (node:internal/modules/esm/module_job:78:21)
at async Promise.all (index 11)
at async link (node:internal/modules/esm/module_job:83:9) {
code: 'ERR_PACKAGE_PATH_NOT_EXPORTED'
}

Node.js v18.12.1`

@molgit
Copy link

molgit commented Aug 27, 2023

I replaced the 3 changed files and got this error when running ingest:

`> [email protected] ingest

tsx -r dotenv/config scripts/ingest-data.ts

node:internal/errors:484 ErrorCaptureStackTrace(err); ^

Error [ERR_PACKAGE_PATH_NOT_EXPORTED]: Package subpath './document_loaders/fs/html' is not defined by "exports" in /home/runner/gpt4-pdf-chatbot-langchain/node_modules/langchain/package.json imported from /home/runner/gpt4-pdf-chatbot-langchain/scripts/ingest-data.ts at __node_internal_captureLargerStackTrace (node:internal/errors:484:5) at new NodeError (node:internal/errors:393:5) at throwExportsNotFound (node:internal/modules/esm/resolve:358:9) at packageExportsResolve (node:internal/modules/esm/resolve:668:3) at packageResolve (node:internal/modules/esm/resolve:843:14) at moduleResolve (node:internal/modules/esm/resolve:909:20) at defaultResolve (node:internal/modules/esm/resolve:1124:11) at nextResolve (node:internal/modules/esm/loader:163:28) at u (file:///home/runner/gpt4-pdf-chatbot-langchain/node_modules/@esbuild-kit/esm-loader/dist/index.js:1:2406) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async nextResolve (node:internal/modules/esm/loader:163:22) at async ESMLoader.resolve (node:internal/modules/esm/loader:841:24) at async ESMLoader.getModuleJob (node:internal/modules/esm/loader:424:7) at async ModuleWrap. (node:internal/modules/esm/module_job:78:21) at async Promise.all (index 11) at async link (node:internal/modules/esm/module_job:83:9) { code: 'ERR_PACKAGE_PATH_NOT_EXPORTED' }

Node.js v18.12.1`

Removing this line helped to resolve the issue:

import { UnstructuredHTMLLoader } from "langchain/document_loaders/fs/html";

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants