-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #43 from OoriData/PGVector-testing
Pg vector testing
- Loading branch information
Showing
9 changed files
with
602 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,293 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Database Demo\n", | ||
"\n", | ||
"Sample functionality for creating tables, inserting data and running similarity search with OgbujiPT.\n", | ||
"\n", | ||
"Notes:\n", | ||
"- `pip install jupyter` if notebook is not running\n", | ||
"\n", | ||
"This notebook will attempt to access a database named `PGv` at `sofola:5432`, using the username `oori` and password `example`. If you have a different setup, you can change the connection string in the first cell." | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Initial setup and Imports" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 118, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"DB_NAME = 'PGv'\n", | ||
"HOST = 'sofola'\n", | ||
"PORT = 5432\n", | ||
"USER = 'oori'\n", | ||
"PASSWORD = 'example'" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 119, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from ogbujipt.embedding_helper import PGvectorConnection\n", | ||
"\n", | ||
"from sentence_transformers import SentenceTransformer\n", | ||
"\n", | ||
"e_model = SentenceTransformer('all-MiniLM-L6-v2') # Load the embedding model\n", | ||
"\n", | ||
"pacer_copypasta = [ # Demo data\n", | ||
" 'The FitnessGram™ Pacer Test is a multistage aerobic capacity test that progressively gets more difficult as it continues.', \n", | ||
" 'The 20 meter pacer test will begin in 30 seconds. Line up at the start.', \n", | ||
" 'The running speed starts slowly, but gets faster each minute after you hear this signal.', \n", | ||
" '[beep] A single lap should be completed each time you hear this sound.', \n", | ||
" '[ding] Remember to run in a straight line, and run as long as possible.', \n", | ||
" 'The second time you fail to complete a lap before the sound, your test is over.', \n", | ||
" 'The test will begin on the word start. On your mark, get ready, start.'\n", | ||
"]" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Connecting to the database" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 120, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Connecting to database...\n", | ||
"Connected to database.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print(\"Connecting to database...\")\n", | ||
"vDB = await PGvectorConnection.create(\n", | ||
" embedding_model=e_model, \n", | ||
" db_name=DB_NAME,\n", | ||
" host=HOST,\n", | ||
" port=int(PORT),\n", | ||
" user=USER,\n", | ||
" password=PASSWORD\n", | ||
" )\n", | ||
"print(\"Connected to database.\")" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Create Tables" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 121, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"PGvector extension created and loaded.\n", | ||
"Table dropped.\n", | ||
"Table created.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Ensuring that the vector extension is installed\n", | ||
"await vDB.conn.execute('''CREATE EXTENSION IF NOT EXISTS vector;''')\n", | ||
"print(\"PGvector extension created and loaded.\")\n", | ||
"\n", | ||
"# Drop the table if one is found\n", | ||
"await vDB.conn.execute('''DROP TABLE IF EXISTS embeddings;''')\n", | ||
"print(\"Table dropped.\")\n", | ||
"\n", | ||
"# Creating a new table\n", | ||
"await vDB.create_doc_table(table_name='embeddings')\n", | ||
"print(\"Table created.\")" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Inserting Data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 122, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"for index, text in enumerate(pacer_copypasta): # For each line in the copypasta\n", | ||
" await vDB.insert_doc_table( # Insert the line into the table\n", | ||
" table_name='embeddings', # The name of the table being inserted into\n", | ||
" content=text, # The text to be embedded\n", | ||
" permission='public', # Permission metadata for access control\n", | ||
" title=f'Pacer Copypasta line {index}', # Title metadata\n", | ||
" page_numbers=[1, 2, 3], # Page number metadata\n", | ||
" tags=['fitness', 'pacer', 'copypasta'], # Tag metadata\n", | ||
" )" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Similarity search" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 123, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"k = 3 # Setting number of rows to return when searching\n", | ||
"\n", | ||
"from pprint import pprint\n", | ||
"def print_results(results): # Helper function to print results\n", | ||
" print(f'RAW RETURN:') \n", | ||
" pprint(results) # Print the raw results\n", | ||
" print(f'\\nRETURNED TITLE:\\n\"{results[0][\"title\"]}\"') # Print the title of the first result\n", | ||
" print(f'RETURNED CONTENT:\\n\"{results[0][\"content\"]}\"') # Print the content of the first result\n", | ||
" print(f'RETURNED COSINE SIMILARITY:\\n{results[0][\"cosine_similarity\"]:.2f}') # Print the cosine similarity of the first result" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Searching the table with a perfect match:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 124, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Semantic Searching data using search string:\n", | ||
"\"[beep] A single lap should be completed each time you hear this sound.\"\n", | ||
"\n", | ||
"RAW RETURN:\n", | ||
"[<Record cosine_similarity=1.0 title='Pacer Copypasta line 3' content='[beep] A single lap should be completed each time you hear this sound.'>,\n", | ||
" <Record cosine_similarity=0.685540756152295 title='Pacer Copypasta line 5' content='The second time you fail to complete a lap before the sound, your test is over.'>,\n", | ||
" <Record cosine_similarity=0.36591741151356405 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.'>]\n", | ||
"\n", | ||
"RETURNED TITLE:\n", | ||
"\"Pacer Copypasta line 3\"\n", | ||
"RETURNED CONTENT:\n", | ||
"\"[beep] A single lap should be completed each time you hear this sound.\"\n", | ||
"RETURNED COSINE SIMILARITY:\n", | ||
"1.00\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"search_string = '[beep] A single lap should be completed each time you hear this sound.'\n", | ||
"print(f'Semantic Searching data using search string:\\n\"{search_string}\"\\n')\n", | ||
"\n", | ||
"sim_search = await vDB.search_doc_table(table_name='embeddings', query_string=search_string, limit=k)\n", | ||
"\n", | ||
"print_results(sim_search)" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Searching the table with a partial match:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 125, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Semantic Searching data using search string:\n", | ||
"\"straight\"\n", | ||
"\n", | ||
"RAW RETURN:\n", | ||
"[<Record cosine_similarity=0.28423854269729953 title='Pacer Copypasta line 4' content='[ding] Remember to run in a straight line, and run as long as possible.'>,\n", | ||
" <Record cosine_similarity=0.10402820694362547 title='Pacer Copypasta line 6' content='The test will begin on the word start. On your mark, get ready, start.'>,\n", | ||
" <Record cosine_similarity=0.07991296083513344 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.'>]\n", | ||
"\n", | ||
"RETURNED TITLE:\n", | ||
"\"Pacer Copypasta line 4\"\n", | ||
"RETURNED CONTENT:\n", | ||
"\"[ding] Remember to run in a straight line, and run as long as possible.\"\n", | ||
"RETURNED COSINE SIMILARITY:\n", | ||
"0.28\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"search_string = 'straight'\n", | ||
"print(f'Semantic Searching data using search string:\\n\"{search_string}\"\\n')\n", | ||
"\n", | ||
"sim_search = await vDB.search_doc_table(table_name='embeddings', query_string=search_string, limit=k)\n", | ||
"\n", | ||
"print_results(sim_search)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "env", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.4" | ||
}, | ||
"orig_nbformat": 4 | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,4 +3,4 @@ | |
# SPDX-License-Identifier: Apache-2.0 | ||
# ogbujipt.about | ||
|
||
__version__ = "0.4.1" | ||
__version__ = "0.5.0" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.