Skip to content

AlgoETS/SimilityVectorEmbedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simility Vector Embedding

Build Status Python Version License

Overview

This project demonstrates the power of natural language processing combined with vector databases to efficiently find similar movies based on their descriptions and metadata. Using technologies such as PostgreSQL with pgvector and advanced NLP models, this project provides a robust solution for similarity searches in large datasets.

Animation8 Animation13 Animation9

Capture3

Usage

Use the Jupyter notebook. This could include generating embeddings, inserting data into the database, or querying for similar movies.

Database Setup and Data Handling

Database

image

Working with Embeddings

Discuss how embeddings are generated using models like BERT or Sentence Transformers, and how they are utilized within pgvector to perform fast and efficient cosine similarity searches.

Finding Similar Movies

Detail the SQL queries and Python functions used to find movies similar to a given query movie based on embeddings similarity.

Understanding Vector Querying and Cosine Similarity

Vector Querying with pgvector

Pgvector is a PostgreSQL extension that facilitates efficient storage and querying of high-dimensional vectors. In this project, we leverage pgvector to handle vector data derived from movie embeddings. These embeddings represent the semantic content of movie descriptions and metadata, allowing for advanced querying capabilities like nearest neighbor searches.

Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors. This metric is widely used in natural language processing to assess how similar two documents (or in this case, movie descriptions) are irrespective of their size. Mathematically, it's defined as:

[ \text{Cosine Similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} ]

where (\mathbf{A}) and (\mathbf{B}) are two vectors, and (|\mathbf{A}|) and (|\mathbf{B}|) are their norms.

image

Implementing Cosine Similarity in PostgreSQL with pgvector

Pgvector supports several distance metrics, including cosine similarity (denoted as <=> in SQL). By utilizing this function, we can perform fast cosine distance calculations directly within SQL queries, which is critical for efficient similarity searches. Here’s how you can find similar movies based on cosine similarity:

SELECT title, embedding
FROM movies
ORDER BY embedding <=> (SELECT embedding FROM movies WHERE title = %s) ASC
LIMIT 10;

This SQL command retrieves the ten most similar movies to a given movie based on their embeddings' cosine similarity.

Other Distance Functions Supported by pgvector

Pgvector also supports other distance metrics such as L2 (Euclidean), L1 (Manhattan), and Dot Product. Each of these metrics can be selected based on the specific needs of your query or the characteristics of your data. Here’s how you might use these metrics:

  • L2 Distance (Euclidean): Suitable for measuring the absolute differences between vectors.
  • L1 Distance (Manhattan): Useful in high-dimensional data spaces.

JSON

image

Movie Entry

Here is an example of how a movie is represented in the movies.json:

{
  "titre": "George of the Jungle",
  "annee": "1997",
  "pays": "USA",
  "langue": "English",
  "duree": "92",
  "resume": "George grows up in the jungle raised by apes. Based on the Cartoon series.",
  "genre": ["Action", "Adventure", "Comedy", "Family", "Romance"],
  "realisateur": {"_id": "918873", "__text": "Sam Weisman"},
  "scenariste": ["Jay Ward", "Dana Olsen"],
  "role": [
    {"acteur": {"_id": "409", "__text": "Brendan Fraser"}, "personnage": "George of the Jungle"},
    {"acteur": {"_id": "5182", "__text": "Leslie Mann"}, "personnage": "Ursula Stanhope"}
  ],
  "poster": "https://m.media-amazon.com/images/M/MV5BNTdiM2VjYjYtZjEwNS00ZWU5LWFkZGYtZGYxMDcwMzY1OTEzL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyMTczNjQwOTY@._V1_SY150_CR0,0,101,150_.jpg",
  "_id": "119190"
}

IMDB databased

https://developer.imdb.com/non-commercial-datasets/

image

Reference

Releases

No releases published

Packages

No packages published

Languages