Merge pull request #43 from OoriData/PGVector-testing

Pg vector testing
OoriData · Oct 5, 2023 · 5a08fe8 · 5a08fe8
2 parents 6eb8046 + 223f8a9
commit 5a08fe8
Show file tree

Hide file tree

Showing 9 changed files with 602 additions and 10 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -20,6 +20,9 @@ jobs:
         run: |
           python -m pip install --upgrade pip
           pip install ruff pytest pytest-mock httpx
+
+          pip install asyncpg  # this definitely shouldn't be here, but i wanna get the tests running -Osi
+          
           if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
           # Install OgbujiPT itself, as checked out
           pip install -U $GITHUB_WORKSPACE

diff --git a/.gitignore b/.gitignore
@@ -167,3 +167,4 @@ cython_debug/
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
+demo/docker_compose.yaml
diff --git a/demo/PGvector_demo.ipynb b/demo/PGvector_demo.ipynb
@@ -0,0 +1,293 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Database Demo\n",
+    "\n",
+    "Sample functionality for creating tables, inserting data and running similarity search with OgbujiPT.\n",
+    "\n",
+    "Notes:\n",
+    "- `pip install jupyter` if notebook is not running\n",
+    "\n",
+    "This notebook will attempt to access a database named `PGv` at `sofola:5432`, using the username `oori` and password `example`. If you have a different setup, you can change the connection string in the first cell."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initial setup and Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 118,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DB_NAME = 'PGv'\n",
+    "HOST = 'sofola'\n",
+    "PORT = 5432\n",
+    "USER = 'oori'\n",
+    "PASSWORD = 'example'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 119,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from ogbujipt.embedding_helper import PGvectorConnection\n",
+    "\n",
+    "from sentence_transformers     import SentenceTransformer\n",
+    "\n",
+    "e_model = SentenceTransformer('all-MiniLM-L6-v2')  # Load the embedding model\n",
+    "\n",
+    "pacer_copypasta = [  # Demo data\n",
+    "    'The FitnessGram™ Pacer Test is a multistage aerobic capacity test that progressively gets more difficult as it continues.', \n",
+    "    'The 20 meter pacer test will begin in 30 seconds. Line up at the start.', \n",
+    "    'The running speed starts slowly, but gets faster each minute after you hear this signal.', \n",
+    "    '[beep] A single lap should be completed each time you hear this sound.', \n",
+    "    '[ding] Remember to run in a straight line, and run as long as possible.', \n",
+    "    'The second time you fail to complete a lap before the sound, your test is over.', \n",
+    "    'The test will begin on the word start. On your mark, get ready, start.'\n",
+    "]"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Connecting to the database"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 120,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Connecting to database...\n",
+      "Connected to database.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"Connecting to database...\")\n",
+    "vDB = await PGvectorConnection.create(\n",
+    "    embedding_model=e_model, \n",
+    "    db_name=DB_NAME,\n",
+    "    host=HOST,\n",
+    "    port=int(PORT),\n",
+    "    user=USER,\n",
+    "    password=PASSWORD\n",
+    "    )\n",
+    "print(\"Connected to database.\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create Tables"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 121,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "PGvector extension created and loaded.\n",
+      "Table dropped.\n",
+      "Table created.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Ensuring that the vector extension is installed\n",
+    "await vDB.conn.execute('''CREATE EXTENSION IF NOT EXISTS vector;''')\n",
+    "print(\"PGvector extension created and loaded.\")\n",
+    "\n",
+    "# Drop the table if one is found\n",
+    "await vDB.conn.execute('''DROP TABLE IF EXISTS embeddings;''')\n",
+    "print(\"Table dropped.\")\n",
+    "\n",
+    "# Creating a new table\n",
+    "await vDB.create_doc_table(table_name='embeddings')\n",
+    "print(\"Table created.\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Inserting Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 122,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for index, text in enumerate(pacer_copypasta):   # For each line in the copypasta\n",
+    "    await vDB.insert_doc_table(                  # Insert the line into the table\n",
+    "        table_name='embeddings',                 # The name of the table being inserted into\n",
+    "        content=text,                            # The text to be embedded\n",
+    "        permission='public',                     # Permission metadata for access control\n",
+    "        title=f'Pacer Copypasta line {index}',   # Title metadata\n",
+    "        page_numbers=[1, 2, 3],                  # Page number metadata\n",
+    "        tags=['fitness', 'pacer', 'copypasta'],  # Tag metadata\n",
+    "    )"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Similarity search"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 123,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "k = 3  # Setting number of rows to return when searching\n",
+    "\n",
+    "from pprint import pprint\n",
+    "def print_results(results):  # Helper function to print results\n",
+    "    print(f'RAW RETURN:')  \n",
+    "    pprint(results)                                                              # Print the raw results\n",
+    "    print(f'\\nRETURNED TITLE:\\n\"{results[0][\"title\"]}\"')                            # Print the title of the first result\n",
+    "    print(f'RETURNED CONTENT:\\n\"{results[0][\"content\"]}\"')                          # Print the content of the first result\n",
+    "    print(f'RETURNED COSINE SIMILARITY:\\n{results[0][\"cosine_similarity\"]:.2f}')  # Print the cosine similarity of the first result"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Searching the table with a perfect match:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 124,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Semantic Searching data using search string:\n",
+      "\"[beep] A single lap should be completed each time you hear this sound.\"\n",
+      "\n",
+      "RAW RETURN:\n",
+      "[<Record cosine_similarity=1.0 title='Pacer Copypasta line 3' content='[beep] A single lap should be completed each time you hear this sound.'>,\n",
+      " <Record cosine_similarity=0.685540756152295 title='Pacer Copypasta line 5' content='The second time you fail to complete a lap before the sound, your test is over.'>,\n",
+      " <Record cosine_similarity=0.36591741151356405 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.'>]\n",
+      "\n",
+      "RETURNED TITLE:\n",
+      "\"Pacer Copypasta line 3\"\n",
+      "RETURNED CONTENT:\n",
+      "\"[beep] A single lap should be completed each time you hear this sound.\"\n",
+      "RETURNED COSINE SIMILARITY:\n",
+      "1.00\n"
+     ]
+    }
+   ],
+   "source": [
+    "search_string = '[beep] A single lap should be completed each time you hear this sound.'\n",
+    "print(f'Semantic Searching data using search string:\\n\"{search_string}\"\\n')\n",
+    "\n",
+    "sim_search = await vDB.search_doc_table(table_name='embeddings', query_string=search_string, limit=k)\n",
+    "\n",
+    "print_results(sim_search)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Searching the table with a partial match:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 125,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Semantic Searching data using search string:\n",
+      "\"straight\"\n",
+      "\n",
+      "RAW RETURN:\n",
+      "[<Record cosine_similarity=0.28423854269729953 title='Pacer Copypasta line 4' content='[ding] Remember to run in a straight line, and run as long as possible.'>,\n",
+      " <Record cosine_similarity=0.10402820694362547 title='Pacer Copypasta line 6' content='The test will begin on the word start. On your mark, get ready, start.'>,\n",
+      " <Record cosine_similarity=0.07991296083513344 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.'>]\n",
+      "\n",
+      "RETURNED TITLE:\n",
+      "\"Pacer Copypasta line 4\"\n",
+      "RETURNED CONTENT:\n",
+      "\"[ding] Remember to run in a straight line, and run as long as possible.\"\n",
+      "RETURNED COSINE SIMILARITY:\n",
+      "0.28\n"
+     ]
+    }
+   ],
+   "source": [
+    "search_string = 'straight'\n",
+    "print(f'Semantic Searching data using search string:\\n\"{search_string}\"\\n')\n",
+    "\n",
+    "sim_search = await vDB.search_doc_table(table_name='embeddings', query_string=search_string, limit=k)\n",
+    "\n",
+    "print_results(sim_search)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "env",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/demo/README.md b/demo/README.md
@@ -31,7 +31,7 @@ support multiprocessing
 
 ## chat_web_selects.py
 
-Simple, command-line "chat my web site" demo, but supporting self-hosted LLM.
+Command-line "chat my web site" demo, but supporting self-hosted LLM.
 
 Definitely a good idea for you to understand demos/alpaca_multitask_fix_xml.py
 before swapping this in.
@@ -65,3 +65,36 @@ though you can easily extend it to e.g. work with multiple docs
 dropped in a directory
 
 Note: manual used for above demo downloaded from Hasbro via [Board Game Capital](https://www.boardgamecapital.com/monopoly-rules.htm).
+
+## PGvector_demo.py
+A demo of the PGvector vector store functionality of OgbujiPT, which takes an initial sample collection of strings and performs a few example actions with them:
+
+1. set up a table in the PGvector store
+2. vectorizes and inserts the strings in the PGvector store
+3. performs a perfect search for one of the sample strings
+4. performs a fuzzy search for a word that is in one of the sample strings
+
+At oori, the demo is run utilizing the [official PGvector docker container](https://hub.docker.com/r/ankane/pgvector) and the following docker compose:
+```docker-compose
+version: '3.1'
+
+services:
+
+  db:
+    image: ankane/pgvector
+    # restart: always
+    environment:
+      POSTGRES_USER: oori
+      POSTGRES_PASSWORD: example
+      POSTGRES_DB: PGv
+    ports:
+      - 5432:5432
+    volumes:
+      - ./pg_hba.conf:/var/lib/postgresql/pg_hba.conf
+
+  adminer:
+    image: adminer
+    restart: always
+    ports:
+      - 8080:8080
+```
diff --git a/pylib/__about__.py b/pylib/__about__.py
@@ -3,4 +3,4 @@
 # SPDX-License-Identifier: Apache-2.0
 # ogbujipt.about
 
-__version__ = "0.4.1"
+__version__ = "0.5.0"
diff --git a/pylib/__init__.py b/pylib/__init__.py
@@ -13,3 +13,10 @@ def oapi_first_choice_text(response):
     Given an OpenAI-compatible API response, return the first choice response text
     '''
     return response['choices'][0]['text']
+
+
+def oapi_first_choice_content(response):
+    '''
+    Given an OpenAI-compatible API response, return the first choice response content
+    '''
+    return response['choices'][0]['message']['content']