Skip to content

Commit

Permalink
delete /pg_test, clean up PGvector_demo.ipynb, and update the demo RE…
Browse files Browse the repository at this point in the history
…ADME.md with relevant information
  • Loading branch information
choccccy committed Oct 3, 2023
1 parent 0a8fba8 commit 149a845
Show file tree
Hide file tree
Showing 9 changed files with 326 additions and 417 deletions.
290 changes: 290 additions & 0 deletions demo/PGvector_demo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Database Demo\n",
"\n",
"Sample functionality for creating tables, inserting data and running similarity search with OgbujiPT.\n",
"\n",
"Notes:\n",
"- `pip install jupyter` if notebook is not running\n",
"\n",
"This notebook will attempt to access a database named `PGv` at `sofola:5432`, using the username `oori` and password `example`. If you have a different setup, you can change the connection string in the first cell."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initial setup and Imports"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [],
"source": [
"DB_NAME = 'PGv'\n",
"HOST = 'sofola'\n",
"PORT = 5432\n",
"USER = 'oori'\n",
"PASSWORD = 'example'"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [],
"source": [
"from ogbujipt.embedding_helper import PGvectorConnection\n",
"\n",
"from sentence_transformers import SentenceTransformer\n",
"\n",
"e_model = SentenceTransformer('all-MiniLM-L6-v2') # Load the embedding model\n",
"\n",
"pacer_copypasta = [ # Demo data\n",
" 'The FitnessGram™ Pacer Test is a multistage aerobic capacity test that progressively gets more difficult as it continues.', \n",
" 'The 20 meter pacer test will begin in 30 seconds. Line up at the start.', \n",
" 'The running speed starts slowly, but gets faster each minute after you hear this signal.', \n",
" '[beep] A single lap should be completed each time you hear this sound.', \n",
" '[ding] Remember to run in a straight line, and run as long as possible.', \n",
" 'The second time you fail to complete a lap before the sound, your test is over.', \n",
" 'The test will begin on the word start. On your mark, get ready, start.'\n",
"]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connecting to the database"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Connecting to database...\n",
"Connected to database.\n"
]
}
],
"source": [
"try:\n",
" print(\"Connecting to database...\")\n",
" vDB = await PGvectorConnection.create(\n",
" embedding_model=e_model, \n",
" db_name=DB_NAME,\n",
" host=HOST,\n",
" port=int(PORT),\n",
" user=USER,\n",
" password=PASSWORD\n",
" )\n",
" print(\"Connected to database.\")\n",
"except Exception as e:\n",
" raise e"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Tables"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"PGvector extension created and loaded.\n",
"Table dropped.\n",
"Table created.\n"
]
}
],
"source": [
"try: # Ensuring that the vector extension is installed\n",
" await vDB.conn.execute('''CREATE EXTENSION IF NOT EXISTS vector;''')\n",
" print(\"PGvector extension created and loaded.\")\n",
"except Exception as e:\n",
" raise e\n",
"\n",
"try: # Drop the table if one is found\n",
" await vDB.conn.execute('''DROP TABLE IF EXISTS embeddings;''')\n",
" print(\"Table dropped.\")\n",
"except Exception as e:\n",
" raise e\n",
"\n",
"try: # Creating a new table\n",
" await vDB.create_doc_table(table_name='embeddings')\n",
" print(\"Table created.\")\n",
"except Exception as e:\n",
" raise e"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Inserting Data"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [],
"source": [
"for index, text in enumerate(pacer_copypasta): # For each line in the copypasta\n",
" await vDB.insert_doc_table( # Insert the line into the table\n",
" table_name='embeddings', # The name of the table being inserted into\n",
" content=text, # The text to be embedded\n",
" permission='public', # Permission metadata for access control\n",
" title=f'Pacer Copypasta line {index}', # Title metadata\n",
" page_numbers=[1, 2, 3], # Page number metadata\n",
" tags=['fitness', 'pacer', 'copypasta'], # Tag metadata\n",
" )"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Similarity search"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [],
"source": [
"k = 3 # Setting number of rows to return when searching"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Searching the table with a perfect match:"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Semantic Searching data using search string: [beep] A single lap should be completed each time you hear this sound.\n",
"RAW RETURN: [<Record cosine_similarity=1.0 title='Pacer Copypasta line 3' content='[beep] A single lap should be completed each time you hear this sound.'>, <Record cosine_similarity=0.685540756152295 title='Pacer Copypasta line 5' content='The second time you fail to complete a lap before the sound, your test is over.'>, <Record cosine_similarity=0.36591741151356405 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.'>]\n",
" RETURNED Title: Pacer Copypasta line 3\n",
" RETURNED Content: [beep] A single lap should be completed each time you hear this sound.\n",
"RETURNED Cosine Similarity: 1.00\n"
]
}
],
"source": [
"search_string = '[beep] A single lap should be completed each time you hear this sound.'\n",
"print(f'Semantic Searching data using search string: {search_string}')\n",
"\n",
"try:\n",
" sim_search = await vDB.search_doc_table(table_name='embeddings', query_string=search_string, limit=k)\n",
"except Exception as e:\n",
" raise e\n",
"\n",
"print(f'RAW RETURN: {sim_search}')\n",
"print()\n",
"print(f' RETURNED Title: {sim_search[0][\"title\"]}')\n",
"print(f' RETURNED Content: {sim_search[0][\"content\"]}')\n",
"print(f'RETURNED Cosine Similarity: {sim_search[0][\"cosine_similarity\"]:.2f}')"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Searching the table with a partial match:"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Semantic Searching data using search string: Straight\n",
"RAW RETURN: [<Record cosine_similarity=0.28423854269729953 title='Pacer Copypasta line 4' content='[ding] Remember to run in a straight line, and run as long as possible.'>, <Record cosine_similarity=0.10402820694362547 title='Pacer Copypasta line 6' content='The test will begin on the word start. On your mark, get ready, start.'>, <Record cosine_similarity=0.07991296083513344 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.'>]\n",
" RETURNED Title: Pacer Copypasta line 4\n",
" RETURNED Content: [ding] Remember to run in a straight line, and run as long as possible.\n",
"RETURNED Cosine Similarity: 0.28\n"
]
}
],
"source": [
"search_string = 'Straight'\n",
"print(f'Semantic Searching data using search string: {search_string}')\n",
"\n",
"try:\n",
" sim_search = await vDB.search_doc_table(table_name='embeddings', query_string=search_string, limit=k)\n",
"except Exception as e:\n",
" raise e\n",
"\n",
"print(f'RAW RETURN: {sim_search}')\n",
"print()\n",
"print(f' RETURNED Title: {sim_search[0][\"title\"]}')\n",
"print(f' RETURNED Content: {sim_search[0][\"content\"]}')\n",
"print(f'RETURNED Cosine Similarity: {sim_search[0][\"cosine_similarity\"]:.2f}')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "env",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
35 changes: 34 additions & 1 deletion demo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ support multiprocessing

## chat_web_selects.py

Simple, command-line "chat my web site" demo, but supporting self-hosted LLM.
Command-line "chat my web site" demo, but supporting self-hosted LLM.

Definitely a good idea for you to understand demos/alpaca_multitask_fix_xml.py
before swapping this in.
Expand Down Expand Up @@ -65,3 +65,36 @@ though you can easily extend it to e.g. work with multiple docs
dropped in a directory

Note: manual used for above demo downloaded from Hasbro via [Board Game Capital](https://www.boardgamecapital.com/monopoly-rules.htm).

## PGvector_demo.py
A demo of the PGvector vector store functionality of OgbujiPT, which takes an initial sample collection of strings and performs a few example actions with them:

1. set up a table in the PGvector store
2. vectorizes and inserts the strings in the PGvector store
3. performs a perfect search for one of the sample strings
4. performs a fuzzy search for a word that is in one of the sample strings

At oori, the demo is run utilizing the [official PGvector docker container](https://hub.docker.com/r/ankane/pgvector) and the following docker compose:
```docker-compose
version: '3.1'
services:
db:
image: ankane/pgvector
# restart: always
environment:
POSTGRES_USER: oori
POSTGRES_PASSWORD: example
POSTGRES_DB: PGv
ports:
- 5432:5432
volumes:
- ./pg_hba.conf:/var/lib/postgresql/pg_hba.conf
adminer:
image: adminer
restart: always
ports:
- 8080:8080
```
Loading

0 comments on commit 149a845

Please sign in to comment.