Need a more performant way to bulk generate embeddings for terms #759

dkotter · 2024-04-16T20:37:04Z

Describe the bug

In v2.2.0 of ClassifAI we added the ability to classify content within your own terms using OpenAI Embeddings. In order for this to work, we need embedding data to be generated for each term and for the post we are comparing those terms to.

The post embedding data is generated on the fly when the comparison is triggered but we don't want to do that for terms as we may have hundreds or thousands. So these are generated in bulk when the feature is first set up. This has always been a known limitation, that if you have lots of terms, this process will probably run into either timeouts, memory issues or rate limit issues with OpenAI.

In #758, we are doing some changes to how OpenAI Embeddings work but we have not yet fixed this issue, so ideally that is fixed and added to the same release (as these changes require all embeddings to be regenerated).

There are two issues I'm currently aware of:

We generate these embeddings when the settings are saved. But we only generate embeddings for taxonomies that are turned on. So the first time you save, the taxonomy settings aren't saved yet so we don't run anything. You have to save again for things to work
We generate embeddings for each term that doesn't currently have an embedding saved during this process (which again, fires when the settings are saved). For sites with 1000+ terms, this will almost certainly lead to timeouts or memory issues. Sites with far fewer terms will probably run into OpenAI rate limits

Ideally we would introduce some sort of queue management system to address this, ideally making this a general enough solution that it can be used by other features that may come in the future. There are tools out there we could look to use, like Action Scheduler or Cavalcade, but we may be fine just building a lightweight system on top of the scheduled event system in WordPress.

Steps to Reproduce

Setup the Classification Feature with OpenAI Embeddings as the Provider
Turn on at least one taxonomy and hit save
Notice that no embeddings are actually generated
Hit save again and notice the embeddings get generated

Can also generate 1000+ terms and try running this process again, though note this will cost money since it makes API requests. I've tested locally using an embeddings model run through Ollama and at around 1000 terms, I run into memory issues

Screenshots, screen recording, code snippet

No response

Environment information

No response

WordPress information

No response

Code of Conduct

I agree to follow this project's Code of Conduct

Sidsector9 · 2024-06-03T12:38:37Z

I've investigated both Action Scheduler and Cavalcade and found that the latter requires disabling WP-Cron.
For this reason I think Action Scheduler is a more reasonable candidate.

I have a branch with Action Scheduler implemented, however I'm facing some PHP memory exhaustion errors. It is intermittent, but I suspect it has to do with scheduling jobs inside the for() loop. I'll fix that and push the branch this week.

dkotter · 2024-06-03T13:20:12Z

@Sidsector9 Worth noting that on a different (private) project, @iamdharmesh implemented https://github.com/deliciousbrains/wp-background-processing to solve this, so that's another tool we can look into. I know he compared that to Action Scheduler and had a few reasons why he decided to use that one, so may be worth talking to him

Sidsector9 · 2024-06-17T18:43:56Z

@dkotter Dharmesh and I discussed this last week and concluded that either/or is a good choice as both has its pros and cons.

I decided to go ahead with Action Scheduler to align with Woo's decision to migrate all the background process related jobs to AS. Related: woocommerce/woocommerce#44246

dkotter added the type:bug Something isn't working. label Apr 16, 2024

dkotter added this to the 3.1.0 milestone Apr 16, 2024

dkotter mentioned this issue Apr 16, 2024

Updates to the OpenAI Embeddings Provider #758

Merged

4 tasks

jeffpaul mentioned this issue May 28, 2024

Release version 3.1.0 #773

Open

21 tasks

Sidsector9 self-assigned this May 29, 2024

Sidsector9 linked a pull request Jun 17, 2024 that will close this issue

fix/759: Add background processing for Embeddings Classification #779

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need a more performant way to bulk generate embeddings for terms #759

Need a more performant way to bulk generate embeddings for terms #759

dkotter commented Apr 16, 2024

Sidsector9 commented Jun 3, 2024

dkotter commented Jun 3, 2024

Sidsector9 commented Jun 17, 2024

Need a more performant way to bulk generate embeddings for terms #759

Need a more performant way to bulk generate embeddings for terms #759

Comments

dkotter commented Apr 16, 2024

Describe the bug

Steps to Reproduce

Screenshots, screen recording, code snippet

Environment information

WordPress information

Code of Conduct

Sidsector9 commented Jun 3, 2024

dkotter commented Jun 3, 2024

Sidsector9 commented Jun 17, 2024