Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull #344

Open
wants to merge 29 commits into
base: feat/add-multiple-pdfs
Choose a base branch
from
Open

pull #344

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
ae1a1bd
fix unique key props warning
mayooear Mar 21, 2023
fccd3b0
Merge pull request #17 from mayooear/fix/next-key-props
mayooear Mar 21, 2023
fe06fc8
general frontend optimization
mayooear Mar 23, 2023
1b6fac8
specify node version in engines, update README for pnpm use
mayooear Mar 23, 2023
55f58da
Merge pull request #39 from mayooear/frontend-fixes
mayooear Mar 23, 2023
10c66b0
upgrade langchain, add customPDFLoader
mayooear Mar 27, 2023
53f5ae6
Merge branch 'main' into feat/upgrade-langchain
mayooear Mar 27, 2023
46bb0ad
Merge pull request #66 from mayooear/feat/upgrade-langchain
mayooear Mar 27, 2023
581f809
Update .env.example
mayooear Mar 27, 2023
b4c88e1
Merge pull request #67 from mayooear/update/pineconeindexenv
mayooear Mar 27, 2023
5bd2a3b
add directory loader to load multiple pdf files
mayooear Mar 28, 2023
90381f0
Merge branch 'main' into feat/add-directory-loader
mayooear Mar 28, 2023
ef4046d
Merge pull request #71 from mayooear/feat/add-directory-loader
mayooear Mar 28, 2023
37fc719
Update README.md
mayooear Apr 1, 2023
a6075e5
Update README.md
mayooear Apr 1, 2023
6db8ba8
Update README.md
mayooear Apr 3, 2023
7a6d82f
langchain retrievers
mayooear Apr 10, 2023
a74abf2
remove merge conflicts
mayooear Apr 10, 2023
b00e3c0
global pnpm installation
mayooear Apr 11, 2023
191e87c
upgrade langchain and pinecone, migrate from pnpm to yarn
mayooear Apr 13, 2023
aff71aa
updated README.md
mayooear Apr 13, 2023
f1ee996
Merge pull request #165 from mayooear/feat/retriever
mayooear Apr 13, 2023
0a6dc57
upgrade dependencies, clean up env files, updated pdfloader
mayooear May 25, 2023
ea4948c
Update LangChain version to current, update history passing
jacoblee93 Jul 11, 2023
7f6c375
Update message passing
jacoblee93 Jul 11, 2023
ba9b663
Update deps, naming
jacoblee93 Aug 11, 2023
66d183f
Merge pull request #376 from jacoblee93/feature_langchain_update
mayooear Aug 11, 2023
31aec79
Update LangChain and Pinecone client, use expression language for chain
jacoblee93 Nov 13, 2023
138bba4
Merge pull request #434 from jacoblee93/jacob/update_versions
mayooear Nov 13, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
OPENAI_API_KEY=

# Update these with your Supabase details from your project settings > API
PINECONE_API_KEY=
# Update these with your pinecone details from your dashboard.
# PINECONE_INDEX_NAME is in the indexes tab under "index name" in blue
# PINECONE_ENVIRONMENT is in indexes tab under "Environment". Example: "us-east1-gcp"
PINECONE_API_KEY=
PINECONE_ENVIRONMENT=

PINECONE_INDEX_NAME=
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,5 @@ next-env.d.ts

#Notion_db
/Notion_DB

.yarn/
56 changes: 35 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,39 @@
# GPT-4 & LangChain - Create a ChatGPT Chatbot for Your PDF Docs
# GPT-4 & LangChain - Create a ChatGPT Chatbot for Your PDF Files

Use the new GPT-4 api to build a chatGPT chatbot for Large PDF docs (56 pages used in this example).
Use the new GPT-4 api to build a chatGPT chatbot for multiple Large PDF files.

Tech stack used includes LangChain, Pinecone, Typescript, Openai, and Next.js. LangChain is a framework that makes it easier to build scalable AI/LLM apps and chatbots. Pinecone is a vectorstore for storing embeddings and your PDF in text to later retrieve similar docs.

[Tutorial video](https://www.youtube.com/watch?v=ih9PBGVVOO4)

[Get in touch via twitter if you have questions](https://twitter.com/mayowaoshin)
[Join the discord if you have questions](https://discord.gg/E4Mc77qwjm)

The visual guide of this repo and tutorial is in the `visual guide` folder.

**If you run into errors, please review the troubleshooting section further down this page.**

Prelude: Please make sure you have already downloaded node on your system and the version is 18 or greater.

## Development

1. Clone the repo
1. Clone the repo or download the ZIP

```
git clone [github https url]
```

2. Install packages

First run `npm install yarn -g` to install yarn globally (if you haven't already).

Then run:

```
pnpm install
yarn install
```

After installation, you should now see a `node_modules` folder.

3. Set up your `.env` file

- Copy `.env.example` into `.env`
Expand All @@ -37,28 +45,30 @@ OPENAI_API_KEY=
PINECONE_API_KEY=
PINECONE_ENVIRONMENT=

PINECONE_INDEX_NAME=

```

- Visit [openai](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key) to retrieve API keys and insert into your `.env` file.
- Visit [pinecone](https://pinecone.io/) to create and retrieve your API keys.
- Visit [pinecone](https://pinecone.io/) to create and retrieve your API keys, and also retrieve your environment and index name from the dashboard.

4. In the `config` folder, replace the `PINECONE_INDEX_NAME` and `PINECONE_NAME_SPACE` with your own details from your pinecone dashboard.
4. In the `config` folder, replace the `PINECONE_NAME_SPACE` with a `namespace` where you'd like to store your embeddings on Pinecone when you run `npm run ingest`. This namespace will later be used for queries and retrieval.

5. In `utils/makechain.ts` chain change the `QA_PROMPT` for your own usecase. Change `modelName` in `new OpenAIChat` to a different api model if you don't have access to `gpt-4`. See [the OpenAI docs](https://platform.openai.com/docs/models/model-endpoint-compatibility) for a list of supported `modelName`s. For example you could use `gpt-3.5-turbo` if you do not have access to `gpt-4`, yet.
5. In `utils/makechain.ts` chain change the `QA_PROMPT` for your own usecase. Change `modelName` in `new OpenAI` to `gpt-4`, if you have access to `gpt-4` api. Please verify outside this repo that you have access to `gpt-4` api, otherwise the application will not work.

## Convert your PDF to embeddings
## Convert your PDF files to embeddings

1. In `docs` folder replace the pdf with your own pdf doc.
**This repo can load multiple PDF files**

2. In `scripts/ingest-data.ts` replace `filePath` with `docs/{yourdocname}.pdf`
1. Inside `docs` folder, add your pdf files or folders that contain pdf files.

3. Run the script `npm run ingest` to 'ingest' and embed your docs
2. Run the script `yarn run ingest` to 'ingest' and embed your docs. If you run into errors troubleshoot below.

4. Check Pinecone dashboard to verify your namespace and vectors have been added.
3. Check Pinecone dashboard to verify your namespace and vectors have been added.

## Run the app

Once you've verified that the embeddings and content have been successfully added to your Pinecone, you can run the app `npm run dev` to launch the local dev environment and then type a question in the chat interface.
Once you've verified that the embeddings and content have been successfully added to your Pinecone, you can run the app `npm run dev` to launch the local dev environment, and then type a question in the chat interface.

## Troubleshooting

Expand All @@ -67,18 +77,22 @@ In general, keep an eye out in the `issues` and `discussions` section of this re
**General errors**

- Make sure you're running the latest Node version. Run `node -v`
- Try a different PDF or convert your PDF to text first. It's possible your PDF is corrupted, scanned, or requires OCR to convert to text.
- `Console.log` the `env` variables and make sure they are exposed.
- Make sure you're using the same versions of LangChain and Pinecone as this repo.
- Check that you've created an `.env` file that contains your valid (and working) API keys.
- If you change `modelName` in `OpenAIChat` note that the correct name of the alternative model is `gpt-3.5-turbo`
- Pinecone indexes of users on the Starter(free) plan are deleted after 7 days of inactivity. To prevent this, send an API request to Pinecone to reset the counter.
- Check that you've created an `.env` file that contains your valid (and working) API keys, environment and index name.
- If you change `modelName` in `OpenAI`, make sure you have access to the api for the appropriate model.
- Make sure you have enough OpenAI credits and a valid card on your billings account.
- Check that you don't have multiple OPENAPI keys in your global environment. If you do, the local `env` file from the project will be overwritten by systems `env` variable.
- Try to hard code your API keys into the `process.env` variables if there are still issues.

**Pinecone errors**

- Make sure your pinecone dashboard `environment` and `index` matches the one in your `config` folder.
- Make sure your pinecone dashboard `environment` and `index` matches the one in the `pinecone.ts` and `.env` files.
- Check that you've set the vector dimensions to `1536`.
- Switch your Environment in pinecone to `us-east1-gcp` if the other environment is causing issues.

If you're stuck after trying all these steps, delete `node_modules`, restart your computer, then `pnpm install` again.
- Make sure your pinecone namespace is in lowercase.
- Pinecone indexes of users on the Starter(free) plan are deleted after 7 days of inactivity. To prevent this, send an API request to Pinecone to reset the counter before 7 days.
- Retry from scratch with a new Pinecone project, index, and cloned repo.

## Credit

Expand Down
2 changes: 1 addition & 1 deletion components/layout.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ export default function Layout({ children }: LayoutProps) {
</nav>
</div>
</header>
<div className="container">
<div>
<main className="flex w-full flex-1 flex-col overflow-hidden">
{children}
</main>
Expand Down
8 changes: 6 additions & 2 deletions config/pinecone.ts
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
/**
* Change the index and namespace to your own
* Change the namespace to the namespace on Pinecone you'd like to store your embeddings.
*/

const PINECONE_INDEX_NAME = 'langchainjsfundamentals';
if (!process.env.PINECONE_INDEX_NAME) {
throw new Error('Missing Pinecone index name in .env file');
}

const PINECONE_INDEX_NAME = process.env.PINECONE_INDEX_NAME ?? '';

const PINECONE_NAME_SPACE = 'pdf-test'; //namespace is optional for your vectors

Expand Down
5 changes: 5 additions & 0 deletions declarations/pdf-parse.d.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
declare module 'pdf-parse/lib/pdf-parse.js' {
import pdf from 'pdf-parse';

export default pdf;
}
Binary file removed docs/MorseVsFrederick.pdf
Binary file not shown.
4 changes: 2 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@
},
"dependencies": {
"@microsoft/fetch-event-source": "^2.0.1",
"@pinecone-database/pinecone": "^0.0.10",
"@pinecone-database/pinecone": "1.1.0",
"@radix-ui/react-accordion": "^1.1.1",
"clsx": "^1.2.1",
"dotenv": "^16.0.3",
"langchain": "0.0.33",
"langchain": "^0.0.186",
"lucide-react": "^0.125.0",
"next": "13.2.3",
"pdf-parse": "1.1.1",
Expand Down
2 changes: 1 addition & 1 deletion pages/_document.tsx
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { Html, Head, Main, NextScript } from "next/document";
import { Html, Head, Main, NextScript } from 'next/document';

export default function Document() {
return (
Expand Down
86 changes: 53 additions & 33 deletions pages/api/chat.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import type { NextApiRequest, NextApiResponse } from 'next';
import { OpenAIEmbeddings } from 'langchain/embeddings';
import { PineconeStore } from 'langchain/vectorstores';
import type { Document } from 'langchain/document';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { PineconeStore } from 'langchain/vectorstores/pinecone';
import { makeChain } from '@/utils/makechain';
import { pinecone } from '@/utils/pinecone-client';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
Expand All @@ -11,52 +12,71 @@ export default async function handler(
) {
const { question, history } = req.body;

console.log('question', question);
console.log('history', history);

//only accept post requests
if (req.method !== 'POST') {
res.status(405).json({ error: 'Method not allowed' });
return;
}

if (!question) {
return res.status(400).json({ message: 'No question in the request' });
}
// OpenAI recommends replacing newlines with spaces for best results
const sanitizedQuestion = question.trim().replaceAll('\n', ' ');

const index = pinecone.Index(PINECONE_INDEX_NAME);

/* create vectorstore*/
const vectorStore = await PineconeStore.fromExistingIndex(
index,
new OpenAIEmbeddings({}),
'text',
PINECONE_NAME_SPACE, //optional
);
try {
const index = pinecone.Index(PINECONE_INDEX_NAME);

res.writeHead(200, {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache, no-transform',
Connection: 'keep-alive',
});
/* create vectorstore*/
const vectorStore = await PineconeStore.fromExistingIndex(
new OpenAIEmbeddings({}),
{
pineconeIndex: index,
textKey: 'text',
namespace: PINECONE_NAME_SPACE, //namespace comes from your config folder
},
);

const sendData = (data: string) => {
res.write(`data: ${data}\n\n`);
};
// Use a callback to get intermediate sources from the middle of the chain
let resolveWithDocuments: (value: Document[]) => void;
const documentPromise = new Promise<Document[]>((resolve) => {
resolveWithDocuments = resolve;
});
const retriever = vectorStore.asRetriever({
callbacks: [
{
handleRetrieverEnd(documents) {
resolveWithDocuments(documents);
},
},
],
});

sendData(JSON.stringify({ data: '' }));
//create chain
const chain = makeChain(retriever);

//create chain
const chain = makeChain(vectorStore, (token: string) => {
sendData(JSON.stringify({ data: token }));
});
const pastMessages = history
.map((message: [string, string]) => {
return [`Human: ${message[0]}`, `Assistant: ${message[1]}`].join('\n');
})
.join('\n');
console.log(pastMessages);

try {
//Ask a question
const response = await chain.call({
//Ask a question using chat history
const response = await chain.invoke({
question: sanitizedQuestion,
chat_history: history || [],
chat_history: pastMessages,
});

const sourceDocuments = await documentPromise;

console.log('response', response);
sendData(JSON.stringify({ sourceDocs: response.sourceDocuments }));
} catch (error) {
res.status(200).json({ text: response, sourceDocuments });
} catch (error: any) {
console.log('error', error);
} finally {
sendData('[DONE]');
res.end();
res.status(500).json({ error: error.message || 'Something went wrong' });
}
}
Loading