inital commit ⚡

Mintplex-Labs · Jun 4, 2023 · 27c5854 · 27c5854
commit 27c5854
Show file tree

Hide file tree

Showing 100 changed files with 5,394 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,10 @@
+v-env
+.env
+!.env.example
+
+node_modules
+__pycache__
+v-env
+*.lock
+.DS_Store
+
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+The MIT License
+
+Copyright (c) Mintplex Labs Inc.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,59 @@
+# 🤖 AnythingLLM: A full-stack personalized AI assistant
+
+[![Twitter](https://img.shields.io/twitter/url/https/twitter.com/tim.svg?style=social&label=Follow%20%40Timothy%20Carambat)](https://twitter.com/tcarambat) [![](https://dcbadge.vercel.app/api/server/6UyHPeGZAC?compact=true&style=flat)](https://discord.gg/6UyHPeGZAC)
+
+A full-stack application and tool suite that enables you to turn any document, resource, or piece of content into a piece of data that any LLM can use as reference during chatting. This application runs with very minimal overhead as by default the LLM and vectorDB are hosted remotely, but can be swapped for local instances. Currently this project supports Pinecone and OpenAI.
+
+![Chatting](/images/screenshots/chat.png)
+[view more screenshots](/images/screenshots/SCREENSHOTS.md)
+
+### Watch the demo!
+
+_tbd_
+
+### Product Overview
+AnythingLLM aims to be a full-stack application where you can use commercial off-the-shelf LLMs with Long-term-memory solutions or use popular open source LLM and vectorDB solutions.
+
+Anything LLM is a full-stack product that you can run locally as well as host remotely and be able to chat intelligently with any documents you provide it.
+
+AnythingLLM divides your documents into objects called `workspaces`. A Workspace functions a lot like a thread, but with the addition of containerization of your documents. Workspaces can share documents, but they do not talk to each other so you can keep your context for each workspace clean.
+
+Some cool features of AnythingLLM
+- Atomically manage documents to be used in long-term-memory from a simple UI
+- Two chat modes `conversation` and `query`. Conversation retains previous questions and amendments. Query is simple QA against your documents
+- Each chat response contains a citation that is linked to the original content
+- Simple technology stack for fast iteration
+- Fully capable of being hosted remotely
+- "Bring your own LLM" model and vector solution. _still in progress_
+- Extremely efficient cost-saving measures for managing very large documents. you'll never pay to embed a massive document or transcript more than once. 90% more cost effective than other LTM chatbots
+
+### Technical Overview
+This monorepo consists of three main sections:
+- `collector`: Python tools that enable you to quickly convert online resources or local documents into LLM useable format.
+- `frontend`: A viteJS + React frontend that you can run to easily create and manage all your content the LLM can use.
+- `server`: A nodeJS + express server to handle all the interactions and do all the vectorDB management and LLM interactions.
+
+### Requirements
+- `yarn` and `node` on your machine
+- `python` 3.8+ for running scripts in `collector/`.
+- access to an LLM like `GPT-3.5`, `GPT-4`*.
+- a [Pinecone.io](https://pinecone.io) free account*.
+*you can use drop in replacements for these. This is just the easiest to get up and running fast.
+
+### How to get started
+- `yarn setup` from the project root directory.
+
+This will fill in the required `.env` files you'll need in each of the application sections. Go fill those out before proceeding or else things won't work right.
+
+Next, you will need some content to embed. This could be a Youtube Channel, Medium articles, local text files, word documents, and the list goes on. This is where you will use the `collector/` part of the repo.
+
+[Go set up and run collector scripts](./collector/README.md)
+
+[Learn about documents](./server/documents/DOCUMENTS.md)
+
+[Learn about vector caching](./server/documents/VECTOR_CACHE.md)
+
+### Contributing
+- create issue
+- create PR with branch name format of `<issue number>-<short name>`
+- yee haw let's merge
diff --git a/clean.sh b/clean.sh
@@ -0,0 +1,2 @@
+# Easily kill process on port because sometimes nodemon fails to reboot
+kill -9 $(lsof -t -i tcp:5000)
diff --git a/collector/.env.example b/collector/.env.example
@@ -0,0 +1 @@
+GOOGLE_APIS_KEY=
diff --git a/collector/.gitignore b/collector/.gitignore
@@ -0,0 +1,6 @@
+outputs/*/*.json
+hotdir/*
+hotdir/processed/*
+!hotdir/__HOTDIR__.md
+!hotdir/processed
+
diff --git a/collector/README.md b/collector/README.md
@@ -0,0 +1,45 @@
+# How to collect data for vectorizing
+This process should be run first. This will enable you to collect a ton of data across various sources. Currently the following services are supported:
+- [x] YouTube Channels
+- [x] Medium
+- [x] Substack
+- [x] Arbitrary Link
+- [x] Gitbook
+- [x] Local Files (.txt, .pdf, etc) [See full list](./hotdir/__HOTDIR__.md)
+_these resources are under development or require PR_
+- Twitter
+![Choices](../images/choices.png)
+
+### Requirements
+- [ ] Python 3.8+
+- [ ] Google Cloud Account (for YouTube channels)
+- [ ] `brew install pandoc` [pandoc](https://pandoc.org/installing.html) (for .ODT document processing) 
+
+### Setup
+This example will be using python3.9, but will work with 3.8+. Tested on MacOs. Untested on Windows
+- install virtualenv for python3.8+ first before any other steps. `python3.9 -m pip install virutalenv`
+- `cd collector` from root directory
+- `python3.9 -m virtualenv v-env`
+- `source v-env/bin/activate`
+- `pip install -r requirements.txt`
+- `cp .env.example .env`
+- `python main.py` for interactive collection or `python watch.py` to process local documents.
+- Select the option you want and follow follow the prompts - Done!
+- run `deactivate` to get back to regular shell
+
+### Outputs
+All JSON file data is cached in the `output/` folder. This is to prevent redundant API calls to services which may have rate limits to quota caps. Clearing out the `output/` folder will execute the script as if there was no cache.
+
+As files are processed you will see data being written to both the `collector/outputs` folder as well as the `server/documents` folder. Later in this process, once you boot up the server you will then bulk vectorize this content from a simple UI!
+
+If collection fails at any point in the process it will pick up where it last bailed out so you are not reusing credits.
+
+### How to get a Google Cloud API Key (YouTube data collection only)
+**required to fetch YouTube transcripts and data**
+- Have a google account
+- [Visit the GCP Cloud Console](https://console.cloud.google.com/welcome)
+- Click on dropdown in top right > Create new project. Name it whatever you like
+  - ![GCP Project Bar](../images/gcp-project-bar.png)
+- [Enable YouTube Data APIV3](https://console.cloud.google.com/apis/library/youtube.googleapis.com)
+- Once enabled generate a Credential key for this API
+- Paste your key after `GOOGLE_APIS_KEY=` in your `collector/.env` file.
diff --git a/collector/hotdir/__HOTDIR__.md b/collector/hotdir/__HOTDIR__.md
@@ -0,0 +1,17 @@
+### What is the "Hot directory"
+
+This is the location where you can dump all supported file types and have them automatically converted and prepared to be digested by the vectorizing service and selected from the AnythingLLM frontend.
+
+Files dropped in here will only be processed when you are running `python watch.py` from the `collector` directory.
+
+Once converted the original file will be moved to the `hotdir/processed` folder so that the original document is still able to be linked to when referenced when attached as a source document during chatting.
+
+**Supported File types**
+- `.md`
+- `.text`
+- `.pdf`
+
+__requires more development__
+- `.png .jpg etc`
+- `.mp3`
+- `.mp4`
diff --git a/collector/main.py b/collector/main.py
@@ -0,0 +1,81 @@
+import os
+from whaaaaat import prompt, Separator
+from scripts.youtube import youtube
+from scripts.link import link, links
+from scripts.substack import substack
+from scripts.medium import medium
+from scripts.gitbook import gitbook
+
+def main():
+  if os.name == 'nt':
+    methods = {
+      '1': 'YouTube Channel',
+      '2': 'Article or Blog Link',
+      '3': 'Substack',
+      '4': 'Medium',
+      '5': 'Gitbook'
+    }
+    print("There are options for data collection to make this easier for you.\nType the number of the method you wish to execute.")
+    print("1. YouTube Channel\n2. Article or Blog Link (Single)\n3. Substack\n4. Medium\n\n[In development]:\nTwitter\n\n")
+    selection = input("Your selection: ")
+    method = methods.get(str(selection))
+  else:
+    questions = [
+      {
+          "type": "list",
+          "name": "collector",
+          "message": "What kind of data would you like to add to convert into long-term memory?",
+          "choices": [
+              "YouTube Channel",
+              "Substack",
+              "Medium",
+              "Article or Blog Link(s)",
+              "Gitbook",
+              Separator(),
+              {"name": "Twitter", "disabled": "Needs PR"},
+              "Abort",
+          ],
+      },
+    ]
+    method = prompt(questions).get('collector')
+
+  if('Article or Blog Link' in method):
+    questions = [
+      {
+          "type": "list",
+          "name": "collector",
+          "message": "Do you want to scrape a single article/blog/url or many at once?",
+          "choices": [
+            'Single URL',
+            'Multiple URLs',
+            'Abort',
+          ],
+      },
+    ]
+    method = prompt(questions).get('collector')
+    if(method == 'Single URL'):
+      link()
+      exit(0)
+    if(method == 'Multiple URLs'):
+      links()
+      exit(0)
+
+  if(method == 'Abort'): exit(0)
+  if(method == 'YouTube Channel'): 
+    youtube()
+    exit(0)
+  if(method == 'Substack'):
+    substack()
+    exit(0)
+  if(method == 'Medium'):
+    medium()
+    exit(0)
+  if(method == 'Gitbook'):
+    gitbook()
+    exit(0)
+
+  print("Selection was not valid.")
+  exit(1)
+
+if __name__ == "__main__":
+  main()
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# Easily kill process on port because sometimes nodemon fails to reboot
		kill -9 $(lsof -t -i tcp:5000)