This project demonstrates the power of natural language processing combined with vector databases to efficiently find similar movies based on their descriptions and metadata. Using technologies such as PostgreSQL with pgvector and advanced NLP models, this project provides a robust solution for similarity searches in large datasets.
Use the Jupyter notebook. This could include generating embeddings, inserting data into the database, or querying for similar movies.
Discuss how embeddings are generated using models like BERT or Sentence Transformers, and how they are utilized within pgvector to perform fast and efficient cosine similarity searches.
Detail the SQL queries and Python functions used to find movies similar to a given query movie based on embeddings similarity.
Pgvector is a PostgreSQL extension that facilitates efficient storage and querying of high-dimensional vectors. In this project, we leverage pgvector to handle vector data derived from movie embeddings. These embeddings represent the semantic content of movie descriptions and metadata, allowing for advanced querying capabilities like nearest neighbor searches.
Cosine similarity measures the cosine of the angle between two vectors. This metric is widely used in natural language processing to assess how similar two documents (or in this case, movie descriptions) are irrespective of their size. Mathematically, it's defined as:
[ \text{Cosine Similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} ]
where (\mathbf{A}) and (\mathbf{B}) are two vectors, and (|\mathbf{A}|) and (|\mathbf{B}|) are their norms.
Pgvector supports several distance metrics, including cosine similarity (denoted as <=>
in SQL). By utilizing this function, we can perform fast cosine distance calculations directly within SQL queries, which is critical for efficient similarity searches. Here’s how you can find similar movies based on cosine similarity:
SELECT title, embedding
FROM movies
ORDER BY embedding <=> (SELECT embedding FROM movies WHERE title = %s) ASC
LIMIT 10;
This SQL command retrieves the ten most similar movies to a given movie based on their embeddings' cosine similarity.
Pgvector also supports other distance metrics such as L2 (Euclidean), L1 (Manhattan), and Dot Product. Each of these metrics can be selected based on the specific needs of your query or the characteristics of your data. Here’s how you might use these metrics:
- L2 Distance (Euclidean): Suitable for measuring the absolute differences between vectors.
- L1 Distance (Manhattan): Useful in high-dimensional data spaces.
Here is an example of how a movie is represented in the movies.json
:
{
"titre": "George of the Jungle",
"annee": "1997",
"pays": "USA",
"langue": "English",
"duree": "92",
"resume": "George grows up in the jungle raised by apes. Based on the Cartoon series.",
"genre": ["Action", "Adventure", "Comedy", "Family", "Romance"],
"realisateur": {"_id": "918873", "__text": "Sam Weisman"},
"scenariste": ["Jay Ward", "Dana Olsen"],
"role": [
{"acteur": {"_id": "409", "__text": "Brendan Fraser"}, "personnage": "George of the Jungle"},
{"acteur": {"_id": "5182", "__text": "Leslie Mann"}, "personnage": "Ursula Stanhope"}
],
"poster": "https://m.media-amazon.com/images/M/MV5BNTdiM2VjYjYtZjEwNS00ZWU5LWFkZGYtZGYxMDcwMzY1OTEzL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyMTczNjQwOTY@._V1_SY150_CR0,0,101,150_.jpg",
"_id": "119190"
}
https://developer.imdb.com/non-commercial-datasets/
- https://www.youtube.com/watch?v=QdDoFfkVkcw
- https://www.machinelearningplus.com/nlp/cosine-similarity/
- https://www.youtube.com/watch?v=Yhtjd7yGGGA
- https://sbert.net
- https://huggingface.co/spaces/mteb/leaderboard
- https://github.com/rabbit-hole-syndrome/open-source-embeddings
- https://sbert.net/docs/pretrained_models.html
- https://cookbook.openai.com/examples/visualizing_embeddings_in_2d
- https://platform.openai.com/docs/guides/embeddings
- https://colab.research.google.com/github/qdrant/examples/blob/master/qdrant_101_audio_data/03_qdrant_101_audio.ipynb
- https://qdrant.tech/documentation/examples/recommendation-system-ovhcloud/
- https://colab.research.google.com/github/qdrant/examples/blob/master/qdrant_101_text_data/qdrant_and_text_data.ipynb
- https://www.youtube.com/watch?v=Vkazja71BkA
- https://www.youtube.com/watch?v=p1LtVo_1Q7A