delete /pg_test, clean up PGvector_demo.ipynb, and update the demo RE…

…ADME.md with relevant information
OoriData · Oct 3, 2023 · 149a845 · 149a845
1 parent 0a8fba8
commit 149a845
Show file tree

Hide file tree

Showing 9 changed files with 326 additions and 417 deletions.
diff --git a/demo/PGvector_demo.ipynb b/demo/PGvector_demo.ipynb
@@ -0,0 +1,290 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Database Demo\n",
+    "\n",
+    "Sample functionality for creating tables, inserting data and running similarity search with OgbujiPT.\n",
+    "\n",
+    "Notes:\n",
+    "- `pip install jupyter` if notebook is not running\n",
+    "\n",
+    "This notebook will attempt to access a database named `PGv` at `sofola:5432`, using the username `oori` and password `example`. If you have a different setup, you can change the connection string in the first cell."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initial setup and Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 100,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DB_NAME = 'PGv'\n",
+    "HOST = 'sofola'\n",
+    "PORT = 5432\n",
+    "USER = 'oori'\n",
+    "PASSWORD = 'example'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 101,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from ogbujipt.embedding_helper import PGvectorConnection\n",
+    "\n",
+    "from sentence_transformers     import SentenceTransformer\n",
+    "\n",
+    "e_model = SentenceTransformer('all-MiniLM-L6-v2')  # Load the embedding model\n",
+    "\n",
+    "pacer_copypasta = [  # Demo data\n",
+    "    'The FitnessGram™ Pacer Test is a multistage aerobic capacity test that progressively gets more difficult as it continues.', \n",
+    "    'The 20 meter pacer test will begin in 30 seconds. Line up at the start.', \n",
+    "    'The running speed starts slowly, but gets faster each minute after you hear this signal.', \n",
+    "    '[beep] A single lap should be completed each time you hear this sound.', \n",
+    "    '[ding] Remember to run in a straight line, and run as long as possible.', \n",
+    "    'The second time you fail to complete a lap before the sound, your test is over.', \n",
+    "    'The test will begin on the word start. On your mark, get ready, start.'\n",
+    "]"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Connecting to the database"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 102,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Connecting to database...\n",
+      "Connected to database.\n"
+     ]
+    }
+   ],
+   "source": [
+    "try:\n",
+    "    print(\"Connecting to database...\")\n",
+    "    vDB = await PGvectorConnection.create(\n",
+    "        embedding_model=e_model, \n",
+    "        db_name=DB_NAME,\n",
+    "        host=HOST,\n",
+    "        port=int(PORT),\n",
+    "        user=USER,\n",
+    "        password=PASSWORD\n",
+    "        )\n",
+    "    print(\"Connected to database.\")\n",
+    "except Exception as e:\n",
+    "    raise e"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create Tables"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 103,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "PGvector extension created and loaded.\n",
+      "Table dropped.\n",
+      "Table created.\n"
+     ]
+    }
+   ],
+   "source": [
+    "try:  # Ensuring that the vector extension is installed\n",
+    "    await vDB.conn.execute('''CREATE EXTENSION IF NOT EXISTS vector;''')\n",
+    "    print(\"PGvector extension created and loaded.\")\n",
+    "except Exception as e:\n",
+    "    raise e\n",
+    "\n",
+    "try:  # Drop the table if one is found\n",
+    "    await vDB.conn.execute('''DROP TABLE IF EXISTS embeddings;''')\n",
+    "    print(\"Table dropped.\")\n",
+    "except Exception as e:\n",
+    "    raise e\n",
+    "\n",
+    "try:  # Creating a new table\n",
+    "    await vDB.create_doc_table(table_name='embeddings')\n",
+    "    print(\"Table created.\")\n",
+    "except Exception as e:\n",
+    "    raise e"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Inserting Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 104,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for index, text in enumerate(pacer_copypasta):   # For each line in the copypasta\n",
+    "    await vDB.insert_doc_table(                  # Insert the line into the table\n",
+    "        table_name='embeddings',                 # The name of the table being inserted into\n",
+    "        content=text,                            # The text to be embedded\n",
+    "        permission='public',                     # Permission metadata for access control\n",
+    "        title=f'Pacer Copypasta line {index}',   # Title metadata\n",
+    "        page_numbers=[1, 2, 3],                  # Page number metadata\n",
+    "        tags=['fitness', 'pacer', 'copypasta'],  # Tag metadata\n",
+    "    )"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Similarity search"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 105,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "k = 3  # Setting number of rows to return when searching"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Searching the table with a perfect match:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 106,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Semantic Searching data using search string: [beep] A single lap should be completed each time you hear this sound.\n",
+      "RAW RETURN: [<Record cosine_similarity=1.0 title='Pacer Copypasta line 3' content='[beep] A single lap should be completed each time you hear this sound.'>, <Record cosine_similarity=0.685540756152295 title='Pacer Copypasta line 5' content='The second time you fail to complete a lap before the sound, your test is over.'>, <Record cosine_similarity=0.36591741151356405 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.'>]\n",
+      "            RETURNED Title: Pacer Copypasta line 3\n",
+      "          RETURNED Content: [beep] A single lap should be completed each time you hear this sound.\n",
+      "RETURNED Cosine Similarity: 1.00\n"
+     ]
+    }
+   ],
+   "source": [
+    "search_string = '[beep] A single lap should be completed each time you hear this sound.'\n",
+    "print(f'Semantic Searching data using search string: {search_string}')\n",
+    "\n",
+    "try:\n",
+    "    sim_search = await vDB.search_doc_table(table_name='embeddings', query_string=search_string, limit=k)\n",
+    "except Exception as e:\n",
+    "    raise e\n",
+    "\n",
+    "print(f'RAW RETURN: {sim_search}')\n",
+    "print()\n",
+    "print(f'            RETURNED Title: {sim_search[0][\"title\"]}')\n",
+    "print(f'          RETURNED Content: {sim_search[0][\"content\"]}')\n",
+    "print(f'RETURNED Cosine Similarity: {sim_search[0][\"cosine_similarity\"]:.2f}')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Searching the table with a partial match:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 107,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Semantic Searching data using search string: Straight\n",
+      "RAW RETURN: [<Record cosine_similarity=0.28423854269729953 title='Pacer Copypasta line 4' content='[ding] Remember to run in a straight line, and run as long as possible.'>, <Record cosine_similarity=0.10402820694362547 title='Pacer Copypasta line 6' content='The test will begin on the word start. On your mark, get ready, start.'>, <Record cosine_similarity=0.07991296083513344 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.'>]\n",
+      "            RETURNED Title: Pacer Copypasta line 4\n",
+      "          RETURNED Content: [ding] Remember to run in a straight line, and run as long as possible.\n",
+      "RETURNED Cosine Similarity: 0.28\n"
+     ]
+    }
+   ],
+   "source": [
+    "search_string = 'Straight'\n",
+    "print(f'Semantic Searching data using search string: {search_string}')\n",
+    "\n",
+    "try:\n",
+    "    sim_search = await vDB.search_doc_table(table_name='embeddings', query_string=search_string, limit=k)\n",
+    "except Exception as e:\n",
+    "    raise e\n",
+    "\n",
+    "print(f'RAW RETURN: {sim_search}')\n",
+    "print()\n",
+    "print(f'            RETURNED Title: {sim_search[0][\"title\"]}')\n",
+    "print(f'          RETURNED Content: {sim_search[0][\"content\"]}')\n",
+    "print(f'RETURNED Cosine Similarity: {sim_search[0][\"cosine_similarity\"]:.2f}')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "env",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/demo/README.md b/demo/README.md
@@ -31,7 +31,7 @@ support multiprocessing
 
 ## chat_web_selects.py
 
-Simple, command-line "chat my web site" demo, but supporting self-hosted LLM.
+Command-line "chat my web site" demo, but supporting self-hosted LLM.
 
 Definitely a good idea for you to understand demos/alpaca_multitask_fix_xml.py
 before swapping this in.
@@ -65,3 +65,36 @@ though you can easily extend it to e.g. work with multiple docs
 dropped in a directory
 
 Note: manual used for above demo downloaded from Hasbro via [Board Game Capital](https://www.boardgamecapital.com/monopoly-rules.htm).
+
+## PGvector_demo.py
+A demo of the PGvector vector store functionality of OgbujiPT, which takes an initial sample collection of strings and performs a few example actions with them:
+
+1. set up a table in the PGvector store
+2. vectorizes and inserts the strings in the PGvector store
+3. performs a perfect search for one of the sample strings
+4. performs a fuzzy search for a word that is in one of the sample strings
+
+At oori, the demo is run utilizing the [official PGvector docker container](https://hub.docker.com/r/ankane/pgvector) and the following docker compose:
+```docker-compose
+version: '3.1'
+
+services:
+
+  db:
+    image: ankane/pgvector
+    # restart: always
+    environment:
+      POSTGRES_USER: oori
+      POSTGRES_PASSWORD: example
+      POSTGRES_DB: PGv
+    ports:
+      - 5432:5432
+    volumes:
+      - ./pg_hba.conf:/var/lib/postgresql/pg_hba.conf
+
+  adminer:
+    image: adminer
+    restart: always
+    ports:
+      - 8080:8080
+```