Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store vector embeddings for talks, and display related talks #19

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

crohr
Copy link

@crohr crohr commented Jun 20, 2023

This PR adds a "Talks you might be interested in" below each talk. This uses the newly released vector support in meilisearch to find talks that share some similarity with the current one.

Support for vector search is currently non-existant in meilisearch-rails and meilisearch-ruby so we cannot easily (?) fetch the similarity scores to keep only the most interesting ones, but the results are quite good as it is.

To make it work, you need to specify an OpenAI API key in .env, and launch a reindex as follows:

Talk.reembed!

Note that vector support is still a bit fiddly, so you may have to start from a fresh meilisearch database if it doesn't work for you on the first try (tip: inspect GET localhost:7700/tasks to see if anything is going wrong when indexing).

Demo below:

rubyvideo-meilisearch-vectors.mp4

Closes #18.

@crohr crohr changed the title Store vector embeddings for talks Store vector embeddings for talks, and display related talks Jun 20, 2023
@crohr crohr marked this pull request as ready for review June 20, 2023 18:58
app/models/talk.rb Outdated Show resolved Hide resolved
@adrienpoly
Copy link
Owner

Thanks @crohr for this prototype it looks very promising.

Embeddings

To move forward if we want to have this in production I guess we could as a first step put in place the logic to compute the embeddings and store them in the talk model. With some kind of logic to recompute them every time title / description changes.
This should prevent recomputing the embeddings for every reindex

Then Meillisearch would just index that new column from our model.

Can this work or I am missing something?

UI

I supposed you put the list below for testing. Ultimately this should feed the cards that are on the right and replace this random suggestion. But while developing let's keep it like that. At some point, we might want to deploy this feature to prod behind a feature flag so that will make it easier to test

Devops

To release it to prod I will need to update the Meillisearch engine. We probably need to wait a bit as it seems pretty new and they highly recommend waiting. This being said whatever result we get out of it will always be better than a random suggestion.

@crohr
Copy link
Author

crohr commented Jun 21, 2023

@Kerollmops I've tried using will_save_change_to__vectors? (see https://github.com/meilisearch/meilisearch-rails#custom-attribute-definition) to make meilisearch avoid reindexing vectors in case title or description hasn't changed, but it looks like _vectors is not seen as an attribute when querying the index settings, and therefore the code in meilisearch-rails doesn't go through the will_save method.

  # this is never called
  def will_save_change_to__vectors?
    will_save_change_to_title? || will_save_change_to_description?
  end

Any reason why this line calls settings.get_attributes instead of get_attributes (get_attributes does have the _vectors key)?

@crohr
Copy link
Author

crohr commented Jun 21, 2023

Embeddings

To move forward if we want to have this in production I guess we could as a first step put in place the logic to compute the embeddings and store them in the talk model. With some kind of logic to recompute them every time title / description changes. This should prevent recomputing the embeddings for every reindex

Then Meillisearch would just index that new column from our model.

Can this work or I am missing something?

I tried to selectively tell meilisearch to ignore _vectors when title or description hasn't changed, but it doesn't seem to work. So yes storing there embedding in sqlite and recomputing with classic AR callbacks if title or description changes would work.

UI

I supposed you put the list below for testing. Ultimately this should feed the cards that are on the right and replace this random suggestion. But while developing let's keep it like that. At some point, we might want to deploy this feature to prod behind a feature flag so that will make it easier to test

Yes, kept it simple for now, you're the UI guy :) Also, I didn't really notice that videos on the right were supposed to be "more like the current one". I think it could make sense to keep both exploratory videos on the right and related videos below the one you've just visioned, but maybe I'm wrong.

Devops

To release it to prod I will need to update the Meillisearch engine. We probably need to wait a bit as it seems pretty new and they highly recommend waiting. This being said whatever result we get out of it will always be better than a random suggestion.

Not sure how you deploy the meilisearch container / process on the server, but yes since there is nothing in meilisearch that can't be reindexed from sqlite, I think it's ok to deploy alpha/beta software in that case.

@Kerollmops
Copy link

Hey @crohr 👋

Any reason why this line calls settings.get_attributes instead of get_attributes (get_attributes does have the _vectors key)?

Sorry for the delay. I will summon @brunoocasali on this one. I would expect this to work as for the Ruby integration, _vectors should look the same as any other field.

Note that, currently, Meilisearch isn't particularly smart when only the title or the description is updated it will reindex the document entirely, the vectors too! I am currently working on something that could help in this regard...

@Kerollmops
Copy link

Hey @crohr 👋 Do you plan to release this recommendation system? We improved the Vector Store solution since then 🧇

@adrienpoly
Copy link
Owner

@Kerollmops thanks for the update I will look into updating the Meilisearch version into the hosting platform. There is an official docker image available now? If I understand correctly I need a 1.3+ version to enable it right?

@Kerollmops
Copy link

@adrienpoly Indeed, you need a v1.3.x and we provide a Docker image and all sort of binaries ☺️

@crohr
Copy link
Author

crohr commented Sep 5, 2023

Hey @crohr 👋 Do you plan to release this recommendation system? We improved the Vector Store solution since then 🧇

Sure, I will have another look this week, thanks for the update!

attribute :title
attribute :description
attribute :slug
attribute :video_id
# ⚠️ This `video_id` attribute makes indexing (silently) fail with v1.3.2 of meilisearch. Error message from meilisearch (GET /tasks):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kerollmops I wonder if there is a regression here with v1.3.2, since it fails to index even if I specifically set the primary_key: :id as above.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently specified the primary key to index one of my datasets and haven't had issues. Are you sure it was introduced in v1.3.2? Is it related to the ruby SDK? I used the primaryKey query parameter with the POST /indexes/{indexname}/documents route.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @crohr why do you need to specify the primary key?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just an attempt to fix the issue I'm explaining in the comment, but it doesn't make a difference (and shouldn't since this is the default)

attribute :video_provider
attribute :thumbnail_sm
attribute :thumbnail_md
attribute :thumbnail_lg
attribute :speakers do
speakers.pluck(:name)
end
# ⚠️ This must return nil and not an empty array if no vector is available.
# Otherwise all other indexing tasks with non-zero vector arrays will silently fail, since the engine will expect all vectors to have the same length.
Copy link
Author

@crohr crohr Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kerollmops this was by far the most time-consuming issue I had with meilisearch vector indexing: seems like meilisearch rejects any document that doesn't have a vector of the same size as the first one it has seen. If you mistakenly send a zero-sized vector (or incorrectly sized vector) for your first document, then all other documents with a different vector size will be rejected. This should be put in a big warning somewhere, or hopefully have some better error message in the server logs (had to inspect the tasks list to see this error).

Copy link

@Kerollmops Kerollmops Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the report @crohr. I shared your feedback with the team and will see what we can do. We recently fixed a bug with the _vectors field that was producing issues when specifying multiple vectors for a single document. This one will be released in v1.3.3.

# Recomputes embedding for all talks that don't have one yet.
def self.reembed!(sleep_interval: 2.seconds, limit: nil)
# required for querying vectors (not indexing)
MeiliSearch::Rails.client.http_patch "/experimental-features", {vectorStore: true}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kerollmops could this be auto-enabled in some way if we start to send vectors?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. You must enable it by hand 😞

@crohr
Copy link
Author

crohr commented Sep 6, 2023

@adrienpoly ready to be reviewed, I've updated the description with the new task to run

This was referenced Oct 18, 2023
@adrienpoly
Copy link
Owner

thanks @crohr I made a bit of preparation work in #64 to integrate on the front end part the results of the suggestions and to isolate then into a frame so that initial page load for the Talk#show route is not coupled to this suggestion method.

My next step is to upgrade Meilisearch in the prod environment and it is not as plug and play as I hope it would be... Anyway it is not neither a very high traffic site so If search is down for a little time that should be ok 😄

Will try to look back into this soon. Thanks for your work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Display videos on similar topics?
4 participants