[RFC] Queues #12

mnapoli · 2021-04-29T13:59:23Z

mnapoli
Apr 29, 2021
Maintainer

The goal of this discussion is to get feedback on the "Queues" component.

If you are new to Lift, a quick intro: it's a Serverless plugin that can be installed via npm and enabled in any serverless.yml file.

Here is what we are planning so far.

Use case

Some tasks are too long to be processed synchronously, for example in an API response.

Instead, they can be processed in the background via a job queue and worker.

Quick start

queues:
  report-generation:
    worker:
      handler: src/report-generator.handler

How it works

The "asynchronous processing" component deploys the following resources:

a SQS queue: this is where jobs to process should be sent
a "worker" Lambda function: this function will be processing each job sent to the SQS queue
a SQS "Dead Letter Queue": this queue will contain all the jobs that failed to be processed
Optionally, a CloudWatch alarm that sends an email if jobs land in the Dead Letter Queue

Permissions

Lift will automatically add IAM permissions to all Lambda functions in the stack:

sqs:SendMessage to the SQS queue

That way, by default all functions can publish into SQS without having to set up IAM (in the spirit of "it works by default").

References

The component will introduce ${queues:xxx} variables to easily reference queues in serverless.yml, without using CloudFormation.

For example:

# serverless.yml
# ...

functions:
  publisher:
    handler: publisher.handler
    environment:
      QUEUE_URL: ${queues:report-generation.queueUrl}

queues:
  report-generation:
    worker:
      handler: worker.handler

In the example above, ${queues:report-generation.queueUrl} will reference the created queue.

Configuration reference

Worker

queues:
  my-queue:
    worker:
      handler: src/report-generator.handler

The Lambda "worker" function is configured inside the queue, instead of being defined in the functions section.

The only required value is the handler: this should point to the code that handles SQS messages. The handler should be written to handle SQS events, for example in JavaScript:

exports.handler = async function(event, context) {
    event.Records.forEach(record => {
        // `record` contains the job that was pushed to SQS
    });
}

All settings allowed for functions can be used under the worker key. For example:

queues:
  my-queue:
    worker:
      handler: src/report-generator.handler
      memorySize: 512
      timeout: 10

Lift will automatically configure the function to be triggered by SQS. It is not necessary to define events on the function.

Should Lift require a reservedConcurrency set explicitly?
I'm tempted to say yes, even AWS recommends it. Else you could throttle your whole account.

Alarm

It is possible to configure email alerts in case jobs end up in the dead letter queue:

queues:
  my-queue:
    # ...
    alarm: [email protected]

Retries

queues:
  my-queue:
    # ...
    maxRetries: 5

Default: 3 retries.

The maxRetries option configures how many times each job will be retried when failing.

If the job still fails after reaching the max retry count, it will be moved to the dead letter queue for storage.

Retry delay

When Lambda fails processing a SQS job (i.e. the code throws an error), the job will be retried after a delay. That delay is also called "Visibility Timeout" in SQS.

By default, Lift configures the retry delay to be 6 times the worker functions timeout, per AWS' recommendation. Since Serverless deploy functions with a timeout of 6 seconds by default, that means that jobs will be retried every 36 seconds.

It is possible to change the function's timeout, the retry delay will be configured accordingly:

queues:
  my-queue:
    # ...
    worker:
      handler: src/report-generator.handler
      # We change the timeout to 10 seconds
      timeout: 10
      # The retry delay will be 10 * 6 => 60 seconds

Should we allow setting the delay directly? (as documented below 👇️)
If think so, because it may be useful to have retries happen much later, e.g. if we want to be resilient to downtimes of 3rd party APIs (downtimes can last minutes or hours).

It is also possible to set the retry delay directly:

queues:
  my-queue:
    # ...
    retryDelay: 600 # 10 minutes

Lift will throw an error if the retry delay is lower than the function's timeout.

Batch size

When the SQS queue contains more than 1 job to process, it can invoke Lambda with a batch of multiple messages at once.

By default, Lambda will be invoked with up to 10 messages at a time. It is possible to change the batch size between 1 to 10 000.

Should the default be 10 (more optimized?) or 1 (safer, because 1 failed message doesn't invalidate the whole batch)?
In my projects, I almost always set a batch size of 1 to avoid any issue with batches. I think this is a safe default to start with and be "robust by default".

queues:
  my-queue:
    # ...
    batchSize: 5 # Lambda will receive 5 messages at a time

Batch window

If the batch size is greater than 10, SQS will wait for some time to collect jobs (i.e. to build the batch).

If the batch size is less than 10, this setting has no effect.

By default, Lift sets a batch window of maximum 5 seconds. That means that during 5 seconds, messages are collected into a batch before invoking Lambda.

The window maximum duration can be lowered (for less latency) or increased (to get larger batches):

queues:
  my-queue:
    # ...
    batchSize: 10
    batchWindow: 20 # seconds

mnapoli · 2021-04-30T10:51:25Z

mnapoli
Apr 30, 2021
Maintainer Author

Alerting

@BenEllerby mentioned the possibility of alerting via different methods, including Slack webhooks. That's very interesting, I'm curious to see if others have that running with SNS.

I've found this AWS article but it mentions using a Lambda function for the trigger. If anyone knows of a simpler approach please share!

4 replies

M1ke Apr 30, 2021

We use Slack webhooks to alert, though we often do this via SNS as then you can trigger it from alarms. But a direct way to do that would be nice too

afu-dev Jun 18, 2021

I have an SNS topic already declared with a lambda function. The function is pushing the alert to our Slack's company (with some formatting). I'm copy/pasting the ARN into multiple CloudFormation template... that's easier than redeploying the same lambda again and again 😅

Maybe a first step would be to allow custom topic ARN inside the alarm parameter? Beside of Slack, there are maybe some other alerting medium people would want to use?

mnapoli Jun 18, 2021
Maintainer Author

@Chtiadrien that's a good idea. Maybe something like:

  # by default it's an email
  alarm: [email protected]
  # but alternatively it can be configured for an existing SNS topic:
  alarm:
    snsTopic: <arn>

I'm taking note of that, but if anyone wants to contribute it I think that makes sense.

ValorMorgan Oct 28, 2021

Another consideration is to take advantage of CloudWatch who can pass alerts to SNS where hooks are setup to push to Slack, Teams, etc. Idea being to expand what the Serverless AWS Alerts Plugin provided.

constructs:
  queue:
    myQueue:
      # handler and other properties
      alarms:
        - functionErrors
        - functionInvocations
        - name: functionDuration
          threshold: 3000

M1ke · 2021-04-30T15:50:03Z

M1ke
Apr 30, 2021

I don't know if it's outside the scope of this, but the amount of yml required to create a webhook endpoint for a queue is huge. In my current Bref.sh project about half of my serverless.yml file is just the setup for a webhook endpoint that dumps the body of the webhook into a queue. Conversely the queue handler is tiny.

1 reply

mnapoli May 3, 2021
Maintainer Author

Ha that's a good one, I don't want to spoil what's next ^^

This specific component (the Queue component) is really targeted at a queue that you feed manually. What you describe is a different use case that we don't want to address with this specific component, but we share your pain.

Pinging @fredericbarthelet FYI

ilyaLibin · 2024-05-30T11:22:41Z

ilyaLibin
May 30, 2024

Is it possible to subscribe to all dead letter queues and consume messages from them?
I want to push all failed queues message payloads to mongodb and create a simple UI for requeue those messages when the bug is fixed.

1 reply

mnapoli May 30, 2024
Maintainer Author

Please open a new topic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Queues #12

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[RFC] Queues #12

mnapoli Apr 29, 2021 Maintainer

Use case

Quick start

How it works

Permissions

References

Configuration reference

Worker

Alarm

Retries

Retry delay

Batch size

Batch window

Replies: 3 comments · 6 replies

mnapoli Apr 30, 2021 Maintainer Author

Alerting

M1ke Apr 30, 2021

afu-dev Jun 18, 2021

mnapoli Jun 18, 2021 Maintainer Author

ValorMorgan Oct 28, 2021

M1ke Apr 30, 2021

mnapoli May 3, 2021 Maintainer Author

ilyaLibin May 30, 2024

mnapoli May 30, 2024 Maintainer Author

mnapoli
Apr 29, 2021
Maintainer

Replies: 3 comments 6 replies

mnapoli
Apr 30, 2021
Maintainer Author

mnapoli Jun 18, 2021
Maintainer Author

M1ke
Apr 30, 2021

mnapoli May 3, 2021
Maintainer Author

ilyaLibin
May 30, 2024

mnapoli May 30, 2024
Maintainer Author