Skip to content

Commit

Permalink
Merge pull request #43 from OoriData/PGVector-testing
Browse files Browse the repository at this point in the history
Pg vector testing
  • Loading branch information
kaischuygon authored Oct 5, 2023
2 parents 6eb8046 + 223f8a9 commit 5a08fe8
Show file tree
Hide file tree
Showing 9 changed files with 602 additions and 10 deletions.
3 changes: 3 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ jobs:
run: |
python -m pip install --upgrade pip
pip install ruff pytest pytest-mock httpx
pip install asyncpg # this definitely shouldn't be here, but i wanna get the tests running -Osi
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
# Install OgbujiPT itself, as checked out
pip install -U $GITHUB_WORKSPACE
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -167,3 +167,4 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
demo/docker_compose.yaml
293 changes: 293 additions & 0 deletions demo/PGvector_demo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Database Demo\n",
"\n",
"Sample functionality for creating tables, inserting data and running similarity search with OgbujiPT.\n",
"\n",
"Notes:\n",
"- `pip install jupyter` if notebook is not running\n",
"\n",
"This notebook will attempt to access a database named `PGv` at `sofola:5432`, using the username `oori` and password `example`. If you have a different setup, you can change the connection string in the first cell."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initial setup and Imports"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [],
"source": [
"DB_NAME = 'PGv'\n",
"HOST = 'sofola'\n",
"PORT = 5432\n",
"USER = 'oori'\n",
"PASSWORD = 'example'"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [],
"source": [
"from ogbujipt.embedding_helper import PGvectorConnection\n",
"\n",
"from sentence_transformers import SentenceTransformer\n",
"\n",
"e_model = SentenceTransformer('all-MiniLM-L6-v2') # Load the embedding model\n",
"\n",
"pacer_copypasta = [ # Demo data\n",
" 'The FitnessGram™ Pacer Test is a multistage aerobic capacity test that progressively gets more difficult as it continues.', \n",
" 'The 20 meter pacer test will begin in 30 seconds. Line up at the start.', \n",
" 'The running speed starts slowly, but gets faster each minute after you hear this signal.', \n",
" '[beep] A single lap should be completed each time you hear this sound.', \n",
" '[ding] Remember to run in a straight line, and run as long as possible.', \n",
" 'The second time you fail to complete a lap before the sound, your test is over.', \n",
" 'The test will begin on the word start. On your mark, get ready, start.'\n",
"]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connecting to the database"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Connecting to database...\n",
"Connected to database.\n"
]
}
],
"source": [
"print(\"Connecting to database...\")\n",
"vDB = await PGvectorConnection.create(\n",
" embedding_model=e_model, \n",
" db_name=DB_NAME,\n",
" host=HOST,\n",
" port=int(PORT),\n",
" user=USER,\n",
" password=PASSWORD\n",
" )\n",
"print(\"Connected to database.\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Tables"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"PGvector extension created and loaded.\n",
"Table dropped.\n",
"Table created.\n"
]
}
],
"source": [
"# Ensuring that the vector extension is installed\n",
"await vDB.conn.execute('''CREATE EXTENSION IF NOT EXISTS vector;''')\n",
"print(\"PGvector extension created and loaded.\")\n",
"\n",
"# Drop the table if one is found\n",
"await vDB.conn.execute('''DROP TABLE IF EXISTS embeddings;''')\n",
"print(\"Table dropped.\")\n",
"\n",
"# Creating a new table\n",
"await vDB.create_doc_table(table_name='embeddings')\n",
"print(\"Table created.\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Inserting Data"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [],
"source": [
"for index, text in enumerate(pacer_copypasta): # For each line in the copypasta\n",
" await vDB.insert_doc_table( # Insert the line into the table\n",
" table_name='embeddings', # The name of the table being inserted into\n",
" content=text, # The text to be embedded\n",
" permission='public', # Permission metadata for access control\n",
" title=f'Pacer Copypasta line {index}', # Title metadata\n",
" page_numbers=[1, 2, 3], # Page number metadata\n",
" tags=['fitness', 'pacer', 'copypasta'], # Tag metadata\n",
" )"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Similarity search"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [],
"source": [
"k = 3 # Setting number of rows to return when searching\n",
"\n",
"from pprint import pprint\n",
"def print_results(results): # Helper function to print results\n",
" print(f'RAW RETURN:') \n",
" pprint(results) # Print the raw results\n",
" print(f'\\nRETURNED TITLE:\\n\"{results[0][\"title\"]}\"') # Print the title of the first result\n",
" print(f'RETURNED CONTENT:\\n\"{results[0][\"content\"]}\"') # Print the content of the first result\n",
" print(f'RETURNED COSINE SIMILARITY:\\n{results[0][\"cosine_similarity\"]:.2f}') # Print the cosine similarity of the first result"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Searching the table with a perfect match:"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Semantic Searching data using search string:\n",
"\"[beep] A single lap should be completed each time you hear this sound.\"\n",
"\n",
"RAW RETURN:\n",
"[<Record cosine_similarity=1.0 title='Pacer Copypasta line 3' content='[beep] A single lap should be completed each time you hear this sound.'>,\n",
" <Record cosine_similarity=0.685540756152295 title='Pacer Copypasta line 5' content='The second time you fail to complete a lap before the sound, your test is over.'>,\n",
" <Record cosine_similarity=0.36591741151356405 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.'>]\n",
"\n",
"RETURNED TITLE:\n",
"\"Pacer Copypasta line 3\"\n",
"RETURNED CONTENT:\n",
"\"[beep] A single lap should be completed each time you hear this sound.\"\n",
"RETURNED COSINE SIMILARITY:\n",
"1.00\n"
]
}
],
"source": [
"search_string = '[beep] A single lap should be completed each time you hear this sound.'\n",
"print(f'Semantic Searching data using search string:\\n\"{search_string}\"\\n')\n",
"\n",
"sim_search = await vDB.search_doc_table(table_name='embeddings', query_string=search_string, limit=k)\n",
"\n",
"print_results(sim_search)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Searching the table with a partial match:"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Semantic Searching data using search string:\n",
"\"straight\"\n",
"\n",
"RAW RETURN:\n",
"[<Record cosine_similarity=0.28423854269729953 title='Pacer Copypasta line 4' content='[ding] Remember to run in a straight line, and run as long as possible.'>,\n",
" <Record cosine_similarity=0.10402820694362547 title='Pacer Copypasta line 6' content='The test will begin on the word start. On your mark, get ready, start.'>,\n",
" <Record cosine_similarity=0.07991296083513344 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.'>]\n",
"\n",
"RETURNED TITLE:\n",
"\"Pacer Copypasta line 4\"\n",
"RETURNED CONTENT:\n",
"\"[ding] Remember to run in a straight line, and run as long as possible.\"\n",
"RETURNED COSINE SIMILARITY:\n",
"0.28\n"
]
}
],
"source": [
"search_string = 'straight'\n",
"print(f'Semantic Searching data using search string:\\n\"{search_string}\"\\n')\n",
"\n",
"sim_search = await vDB.search_doc_table(table_name='embeddings', query_string=search_string, limit=k)\n",
"\n",
"print_results(sim_search)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "env",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
35 changes: 34 additions & 1 deletion demo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ support multiprocessing

## chat_web_selects.py

Simple, command-line "chat my web site" demo, but supporting self-hosted LLM.
Command-line "chat my web site" demo, but supporting self-hosted LLM.

Definitely a good idea for you to understand demos/alpaca_multitask_fix_xml.py
before swapping this in.
Expand Down Expand Up @@ -65,3 +65,36 @@ though you can easily extend it to e.g. work with multiple docs
dropped in a directory

Note: manual used for above demo downloaded from Hasbro via [Board Game Capital](https://www.boardgamecapital.com/monopoly-rules.htm).

## PGvector_demo.py
A demo of the PGvector vector store functionality of OgbujiPT, which takes an initial sample collection of strings and performs a few example actions with them:

1. set up a table in the PGvector store
2. vectorizes and inserts the strings in the PGvector store
3. performs a perfect search for one of the sample strings
4. performs a fuzzy search for a word that is in one of the sample strings

At oori, the demo is run utilizing the [official PGvector docker container](https://hub.docker.com/r/ankane/pgvector) and the following docker compose:
```docker-compose
version: '3.1'
services:
db:
image: ankane/pgvector
# restart: always
environment:
POSTGRES_USER: oori
POSTGRES_PASSWORD: example
POSTGRES_DB: PGv
ports:
- 5432:5432
volumes:
- ./pg_hba.conf:/var/lib/postgresql/pg_hba.conf
adminer:
image: adminer
restart: always
ports:
- 8080:8080
```
2 changes: 1 addition & 1 deletion pylib/__about__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
# SPDX-License-Identifier: Apache-2.0
# ogbujipt.about

__version__ = "0.4.1"
__version__ = "0.5.0"
7 changes: 7 additions & 0 deletions pylib/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,10 @@ def oapi_first_choice_text(response):
Given an OpenAI-compatible API response, return the first choice response text
'''
return response['choices'][0]['text']


def oapi_first_choice_content(response):
'''
Given an OpenAI-compatible API response, return the first choice response content
'''
return response['choices'][0]['message']['content']
Loading

0 comments on commit 5a08fe8

Please sign in to comment.