Bulk import/update of products #1133

Izayda · 2021-10-05T12:38:42Z

Izayda
Oct 5, 2021

Currently, bulk import/update of products generate huge amount of similar operations:

Any change of variant/product generate apply-collections-filters job. Problems:
- Each time job perform equal operation (iterate over each collection and apply collections filters), but that can be done just once after last update/import.
Any change of variant/product generate update-index-job. Problems:
- Job is called from product variant level. But is executed at product level, because each job get product for variant and reindex all variants for product. So, there will be many equal jobs.
- Job is dependent from collections. But can be execute in parallel with apply-collection-job, so, outdated collections data can be passed to index

The aim of this discussion: to determine issues, that should be resolved to eliminate this problems.
Original discussion in Slack by @skid: https://vendure-ecommerce.slack.com/archives/CKYMF0ZTJ/p1632747675049800

Probably possible ways:

Turn off 'apply-collections-filters' and 'update-search-index' jobs. Update variants and products. Complete 'apply-collections-filters' job. Then complete full search reindex.
Make a mutation "startBatchUpdate" that gives us batchUpdateId. Update variants and products and pass to all update/create/delete operations this batchUpdateId. Make a mutation "finishBatchUpdate", that accept batchUpdateId. This mutation complete 'apply-collections-filters' job and then start reindexing of affected variants.
Make new job "update-variant-collections". This job should get all filters and calculate in which collections this variant should be mapped. Then this job get actual collections for this variant and calculate "differ". Then update that "differ". Then this job start reindexing of affected variant.

Is it possible to implement third way (to get collection filters to update from updated data of variant)?
If yes, i think combination of 2 and 3 can be implemented. If no, only second variant.

@michaelbromley need your input and decision.

skid · 2021-10-05T13:45:38Z

skid
Oct 5, 2021

Ideally there would be an API to control when each scheduled job is triggered. Currently the API only allows completely stopping the queue which doesn't give us much control around when tasks are added and executed. Something like this can be added, for example:

myMethod(ctx: RequestContext) {
  const acfQueue = this.jobQueueService.getJobQueues().find(q => q.name === 'apply-collection-filters');

  acfQueue.stop(); // This stops registration of new jobs, but it also prevents execution of tasks

  // A method like this can be useful when doing imports
  acfQueue.pauseForContext(ctx);
  products.forEach(async productData => await this.productService.create(ctx, { ...productData }));
  acfQueue.resumeForContext(ctx);
}

// We also want to be able to prevent queue from taking new jobs but still be able to execute them
onInit() {
  const acfQueue = this.jobQueueService.getJobQueues().find(q => q.name === 'apply-collection-filters');
  acfQueue.sleep(); // Queues that sleep don't add tasks anymore but will run already added tasks (maybe a better name is needed)
}

addProducts(ctx: RequestContext) {
    products.forEach(async productData => await this.productService.create(ctx, { ...productData }));
    const acfQueue = this.jobQueueService.getJobQueues().find(q => q.name === 'apply-collection-filters');

    // This will "wake" the queue in order to add the job, then it will go back to "sleep"
    acfQueue.add({ ...data }, { wake: true });
}

0 replies

michaelbromley · 2021-10-05T20:16:45Z

michaelbromley
Oct 5, 2021
Maintainer

Thank you both @Izayda & @skid for your thoughts on this. Here's my initial thoughts:

Is it possible to implement third way (to get collection filters to update from updated data of variant)?

If I understand you correctly, you are asking if it is possible to calculate the Collections a given ProductVariant should be included in? This might be possible, I'll need to do some investigation.

So I see 2 kinda distinct issues/use-cases here:

Too much background work happening when doing routine inventory management actions.
Too much background work happening when doing custom bulk import/update actions.

For the 1st issue, I agree with the idea of making it possible to turn off automatic apply-collection-filters and update-search-index jobs being fired. So the workflow would be more like:

Edit some product descriptions
Update some prices
Change some stock levels
Manually trigger a apply-collection-filters and then update-search-index.

Ideally we should not need to run a full reindex after these tasks. That's overkill. So a possibility might be to have some kind of "buffer" for all modified Products/Variants. Once we finish all our changes, we "flush" this buffer, which will apply the collection filters and then update the search index for these affected products/variants.

Re-reading @Izayda's suggestions, I think this workflow is very similar to what you are saying, but without the need to explicitly start a batch session and pass an ID around. It would need to work nicely with the Admin UI as well as programatically.

For the 2nd issue, I think that @skid's suggestion is worth exploring too. The general ability to better control the firing of jobs could come in useful in many situations.

I'll spend a bit of time exploring these this week and note any further ideas or issues with what has been suggested.

1 reply

Izayda Oct 5, 2021
Author

Maybe, this can help as some examples of real "CJM".

I have the task of increasing prices of some brand to 10%:

I'm getting variant's of this brand by facet_id )heavy)
I'm sending request to update variants through admin-api
I'm get huge amount of jobs on backend
I'm run script that to cancel all jobs of apply-collections-filter
I'm waiting to all reindex jobs to be completed

The another routine (duplicate PIM entity to channel (Vendure as channel)):

For each brand i run task for upgrade data for that brand
I'm getting variant's of this brand by facet_id (heavy)
I'm calculating differ for each variant differ
I'm update variant if there is some diffirence
I'm run script that to cancel all jobs of apply-collections-filter
I'm waiting to all reindex jobs to be completed

michaelbromley · 2021-10-06T08:43:44Z

michaelbromley
Oct 6, 2021
Maintainer

Alright, I thought about this some more and here's a proposal:

DefaultSearchPlugin.init({ useEventBatching: true }),

We introduce a new option to the DefaultSearchPlugin (the exact same solution can apply to ElasticsearchPlugin too)

extend type Query {
  batchedSearchIndexEventCount: Int!
}

extend type Mutation {
  flushBatchedSearchIndexEvents: Boolean!
}

If we have useEventBatching set to true, then whenever an event arrives with the DefaultSearchPlugin, rather than being processed, it gets stored in this event buffer. This could be in-memory or in external storage like a new DB table (which is better for allowing multi-instance/serverless scenarios).

So running an import or just updating lots of products/variants/collections/assets will not trigger any jobs. All these events will just get buffered in the batch.

In the Admin UI, we can make a call to batchedSearchIndexEventCount and if the value is greater than 0, we can display some kind of visual cue "30 changes have been made to the catalog" with a "update search index" button which will trigger the flushBatchedSearchIndexEvents mutation.

When this mutation is triggered, the DefaultSearchPlugin will read all the batched events, and de-duplicate all the entities by ID so each is only re-indexed once. It would run the apply-collection-filters job exactly once first, and then the de-duplicated product/variant updates.

3 replies

skid Oct 6, 2021

I like this solution!

I would like, though, a little bit more control over how jobs are run. For example I'm using google cloud tasks and I want to manage how and when jobs are executed or maybe do my own deduplication.

Also - I would avoid keeping state in memory - Vendure should be multi-instance ready :)

michaelbromley Oct 6, 2021
Maintainer

Also - I would avoid keeping state in memory - Vendure should be multi-instance ready :)

Agreed. I might use a strategy pattern here so I can swap out the persistence layer for testing, but default to using the DB.

I would like, though, a little bit more control over how jobs are run.

Yes, I agree and I would like to address this too. This initial proposal is just solving point 1 of the 2 issues I mention above. The reason being that it is possible to do this part without changing anything in the core (I think), and with only making changes to the DefaultSearchPlugin (& ES plugin if this seems to work well).

Your above proposal would require a bit of work on the way the job queues work and I'd like to tackle that as a separate issue. In fact, you can open a proper issue for that.

michaelbromley Oct 6, 2021
Maintainer

OK I ran into an issue with the design above: since it pertains only to the DefaultSearchPlugin, it is not able to control the batching of the apply-collection-filters jobs, since this happens in the core CollectionService.

In that case it might make sense to move this batching functionality into the JobQueue itself somehow, more along the lines of @skid's proposal.

This is perhaps a bit more complex implementation-wise, but has the advantage that:

It would be implemented in a generic way, which could be used not only by the search plugins, but in other use-cases too (e.g. batching email sending).
It means we would not have to duplicate the batching implementation in both the DefaultSearchPlugin and the ElasticsearchPlugin.
It might also be possible to incorporate the TS API proposed by @skid above as part of this mechanism.

Hmm.. this is a nice meaty issue. More exploration needed!

michaelbromley · 2021-10-06T12:24:41Z

michaelbromley
Oct 6, 2021
Maintainer

Thinking now of a solution which can sit transparently between the call to JobQueue.add() and the actual processing of the job, like an interceptor. Let's call it JobQueueBuffer. This is just a rough sketch, kinda of thinking out loud:

/**
 * This class is used to control the buffering of jobs. It can be injected into your services.
 */
@Injectable()
export class JobQueueBuffer {
  private processors: JobQueueBufferProcessor[];
  
  addProcessor(processor: JobQueueBufferProcessor);
  removeProcessor(processor: JobQueueBufferProcessor);

  add(job: Job): Promise<Job>;
  bufferSize(forProcessors?: JobQueueBufferProcessor[]): Promise<number>;
  flush(forProcessors?: JobQueueBufferProcessor[]): Promise<void>;
}

// This is used to control which jobs get buffered, and how
// the buffered jobs then get batched before processing
export interface JobQueueBufferProcessor {
  collect(job: Job): Promise<boolean>;
  reduce(collectedJobs: Job[]): Promise<Job[]>;
}

// This defines how the buffered jobs get physically stored
export interface JobQueueBufferStorageStrategy {
  add(processorId: string, job: Job): Promise<Job>;
  bufferSize(processorIds?: string[]): Promise<number>;
  flush(processorIds?: string[]): Promise<void>;
}

Here's how it would work:

When JobQueue.add() is called from any location in the codebase, the job is intercepted by the JobQueueBuffer and it is passed to the collect() method. This method can decide to "collect" (buffer) the job or not. Returning true means "I have buffered this job, do not process yet".
The buffered jobs get persisted to the DB or some other configurable storage mechanism. via JobQueueBufferStorageStrategy.add().
The number of jobs in the buffer can be queried with the JobQueueBufferStorageStrategy.bufferSize() method. Passing forProcessors will return the total number of jobs related to those processors.
Calling flush() will flush all the jobs (optionally limited to the specified processors) from the buffer.
During the flushing process, the JobQueueBuffer.reduce() method will be called. This is where we can batch together the same type of job into a single job and de-duplicate the work to be done by the process function.

Example:

class SearchIndexBuffer implements JobQueueBuffer {
  collect(job: Job) {
    if (job.queueName === 'apply-collection-filters' || job.queueName === 'update-search-index') {
      return true;
    } else {
      return false;
    }
  }
  
 reduce(jobs: Job[]) {
   const updateSearchIndexJobs: Array<Job<UpdateIndexQueueJobData>> = jobs
     .filter(job => job.queueName === 'update-search-index');
   
   const updateVariantJobs = updateSearchIndexJobs.filter(job => job.data.type === 'update-variants-by-id');
   const allVariantIds = unique(updateVariantJobs.reduce((ids, job) => [...ids, ...job.data.ids], []));
   const batchedUpdateVariantsJob = updateVariantJobs[0];
   batchedUpdateVariantsJob.data.ids = allVariantIds;
   
   // then logic for consolidating the other jobs in a similar way
   return consolidatedJobs;
 }
}

// in the DefaultSearchPlugin
onApplicationBootstrap() {
  this.jobQueueBuffer.addProcessor(new SearchIndexBuffer());
}

Here's another example that implements a pause/resume behaviour:

myMethod() {
  const myProcessor = new MyProcessor();

  this.jobQueueBuffer.addProcessor(myProcessor);

  // MyProcessor is designed to collect all the jobs fired during product creation
  products.forEach(async productData => await this.productService.create(ctx, { ...productData }));

  await this.jobQueueBuffer.flush([myProcessor]);
  this.jobQueueBuffer.removeProcessor(myProcessor);
}

2 replies

skid Oct 6, 2021

I might be completely off here - I'm not that familiar with the job-queue code, but If I write my own custom JobQueueStrategy, I can achieve everything that your proposed JobQueueBuffer does - the logic would mostly sit in the JobQueueStrategy.add method which can then decide whether to buffer, batch and so on.

The JobQueueStrategy actually has a pretty good API - as I'm writing this I actually can't see a reason why all of this can't be implemented as a JobQueueStrategy - it would look sort of like this:

export class MyCustomJobQueueStrategy implements JobQueueStrategy {
  async add<Data extends JobData<Data> = {}>(
    job: Job<Data>
  ): Promise<Job<Data>> {
    const queueName = this.getQueueName(job.queueName);
    if (this.pausedQueues.includes(queueName)) {
      this.buffer.push(job);
    } else {
      this.scheduleForExecution(job)
    }
  }

  pause(jobQueue: string) {
    this.pausedQueues.push(jobQueue);
  }
}

Usage

export class SomeService {

  constructor( private jqs: JobQueueService ) {}

  someMethod() {
    // You would need to make the jobQueueStrategy getter public
    this.jqs.jobQueueStrategy.pause('update-search-index'); .
  }
}

The only thing that is lacking here is composability. This would only work if my JobQueueStrategy does all the other work. I am currently using Martijn's Cloud Tasks Plugin and it changes the default jobQueueStrategy with its own. The only way to extend it is to either fork the plugin or wrap it in my custom strategy (which is not such a terrible idea).

It would be nicer though if I could somehow string multiple strategies like middleware since that would make it a lot easier to augment the default jobQueueStrategy (or any other). The above example would become:

export class MyCustomJobQueueStrategy implements JobQueueStrategy {

  async add<Data extends JobData<Data> = {}>(
    job: Job<Data>
  ): Promise<Job<Data>> {
    const queueName = this.getQueueName(job.queueName);
    if (this.pausedQueues.has(queueName)) {
      this.buffer.push(job);
      // If a falsy value is returned it won't be passed on
      return false;
    } else {
      // If a job is returned it will be passed to the next JobQueueStrategy's `.add()` method
      return job;
    }
  }

  pause(jobQueueName: string) {
    this.pausedQueues.add(jobQueueName);
  }

  resume(jobQueueName: string) {
    this.pausedQueues.remove(jobQueueName);
  }

  flush(jobQueueName: string) {
    this.resume(jobQueueName);
    this.buffer.forEach(j => this.add(j))
  }
}

In the config

config.jobQueueOptions.jobQueueStrategy = [
  MyJobQueueStrategy,
  ...config.jobQueueOptions.jobQueueStrategy
]

This will break compatibility, though.

michaelbromley Oct 7, 2021
Maintainer

Hey @skid thanks for contributing your ideas! I'm always interested in ways to simplify the design, so re-using an existing API is a noble goal.

I see a few issues with this though:

As you say, it will be a backwards-incompatible change. That would relegate it to v2.0 which we don't want!
I feel it is violating the single responsibility principle, in that the JobQueueStrategy is intended to control only how the jobs are persisted and accessed, but this design would mean we also would start putting lots of app-specific business logic in a strategy, e.g. the logic needed to batch variantIds in the search index update.
We would also need to add methods to the strategy interface to allow us to query the buffer so that we know whether there are jobs that need flushing.

But your suggestion has inspired me to further simplify the API of my design. It is still very much in flux and I'm exploring ways to improve on what I initially wrote down above.

Izayda · 2021-10-06T14:11:16Z

Izayda
Oct 6, 2021
Author

This thread became more interesting with each message :) Batching solution is great!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk import/update of products #1133

{{title}}

Replies: 5 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Bulk import/update of products #1133

Izayda Oct 5, 2021

Replies: 5 comments · 6 replies

skid Oct 5, 2021

michaelbromley Oct 5, 2021 Maintainer

Izayda Oct 5, 2021 Author

michaelbromley Oct 6, 2021 Maintainer

skid Oct 6, 2021

michaelbromley Oct 6, 2021 Maintainer

michaelbromley Oct 6, 2021 Maintainer

michaelbromley Oct 6, 2021 Maintainer

skid Oct 6, 2021

michaelbromley Oct 7, 2021 Maintainer

Izayda Oct 6, 2021 Author

Izayda
Oct 5, 2021

Replies: 5 comments 6 replies

skid
Oct 5, 2021

michaelbromley
Oct 5, 2021
Maintainer

Izayda Oct 5, 2021
Author

michaelbromley
Oct 6, 2021
Maintainer

michaelbromley Oct 6, 2021
Maintainer

michaelbromley Oct 6, 2021
Maintainer

michaelbromley
Oct 6, 2021
Maintainer

michaelbromley Oct 7, 2021
Maintainer

Izayda
Oct 6, 2021
Author