Add retries for schema registry errors #683

gordon-rennie · 2021-09-05T00:00:46Z

Fixes #663. Updated from abandoned PR #673: rebased to target 3.x and improved retry interface. Restructured PR description for readability

Goal

Prevent transient schema registry API errors from blowing up Vulcan serialisation/deserialisation by adding retries with sensible default config.

Implementation

add a new trait SchemaRegistryClientRetry[F[_]] allowing a retry strategy on an arbitrary computation that uses the schema registry, F[A], to be specified.
add default implementation that retries on schema registry exceptions identified in the underlying Java code (see "What Throwables should we retry?" below).
- the default strategy uses a retry time backoff, requiring Sync to be upgraded to Async on some AvroSettings constructors; therefore this change is breaking and targets 3.x

Retry Interface Design

Feedback is very welcome here. I took care in designing this, drawing significant influence from discussions on issue #623 about commit retry.

Proposed Interface

trait SchemaRegistryClientRetry[F[_]] {

  def withRetry[A](action: F[A]): F[A]
}

// AvroSettings.scala
sealed abstract class AvroSettings[F[_]] {
  ...
  def schemaRegistryClientRetry: SchemaRegistryClientRetry[F]
}

Benefits: very flexible and composable; users can bring their own F to perform logging etc. Easy interop with e.g. cats-retry library. Robust against future implementation changes in Vulcan or the underlying Java libraries; the retry can be applied to any computation thanks to the type param.
Cons: required an intermediate trait to get the type parameter [A] because an anonymous function can't have a type param, e.g. we can't do def schemaRegistryClientRetry: F[A] => F[A].

To try and get rid of the need of the type param and new trait, I also considered a variation of this proposal where the three Vulcan call sites that use the schema registry (directly or indirectly) could take one non-type-paramaterised anonymous function each as retry strategies (e.g. def schemaRegistryGetSchemaRetry: F[ParsedSchema] => F[ParsedSchema]), but this would leave the retry config extremely coupled to the implementation details of the existing ser/des.

Alternative 1

(this was my first iteration)

// AvroSettings.scala
sealed abstract class AvroSettings[F[_]] {
  ...
  def schemaRegistryRetryMaxAttempts: Int
  def schemaRegistryRetryBackoffStrategy: Int => FiniteDuration
}

Benefits: simple interface with no new traits.
Cons: inflexible; no provision to add logging etc. No cats-retry interop.

Alternative 2

Implement tagless final traits as facades over the relevant Java schema registry classes, allow retries to be specified on their implementations.

Cons: this requires huge change.

What `Throwable`s should we retry?

Some code analysis was required to determine where the schema registry is used and what exceptions it can throw. I wanted to avoid retrying on every error, which would include deterministic failures.

in general, schema registry client operations can throw IOException or RestClientException
for serialization, the relevant Java KafkaAvroSerializer error handling maps both IOException and RestClientException (if non-client http error code) to SerializationException
for deserialization, the vulcan code makes a call using the schema registry client itself and further calls are made by the underlying Java code (1, 2, 3) which are mapped again to SerializationException

TL;DR this new helper function catches the relevant exceptions:

  val isRetriable: Throwable => Boolean = {
    case _: SerializationException     => true
    case _: IOException                => true
    case apiError: RestClientException => apiError.getErrorCode >= 500
    case _                             => false
  }

I made the function public so users of cats-retry etc. have it available to them for easy re-use.

TODO

Once the interface is agreed, I need to add unit tests and docs.

modules/vulcan/src/main/scala/fs2/kafka/vulcan/AvroSettings.scala

gordon-rennie · 2021-09-05T00:07:16Z

modules/vulcan/src/main/scala/fs2/kafka/vulcan/SchemaRegistryClientRetry.scala

+  def Default[F[_]](implicit F: Async[F]): SchemaRegistryClientRetry[F] =
+    new SchemaRegistryClientRetry[F] {
+      override def withRetry[A](action: F[A]): F[A] = {
+        def retry(attempt: Int, action: F[A]): F[A] =
+          action.handleErrorWith(
+            err =>
+              if ((attempt + 1) <= 10 && isRetriable(err))
+                F.sleep((10 * Math.pow(2, attempt.toDouble)).millis) >> retry(attempt + 1, action)
+              else
+                F.raiseError(err)
+          )
+
+        retry(attempt = 1, action)
+      }


The Java SchemaRegistryClient uses an underlying RestService with 60 second timeouts. Ten attempts is probably far too many if timeouts are occurring. I'm thinking of mitigating by reducing to ~3 attempts with a more aggressive backoff, 100 millis and 1000 millis.

gordon-rennie · 2021-10-07T12:31:48Z

Hi @bplommer, we spoke about this issue in #663 - I'm curious to know how you feel the PR turned out? It sprawled a little due to adding configuration of the retries -- figuring out what that interface should look like has easily been the most time-consuming part.

bplommer · 2021-10-07T12:47:09Z

Hi @gordon-rennie - so sorry for dropping the ball on this! I'll do my best to look at this properly over the weekend.

gordon-rennie · 2021-10-07T13:22:31Z

Hi @gordon-rennie - so sorry for dropping the ball on this! I'll do my best to look at this properly over the weekend.

No worries at all, thanks!

modules/vulcan/src/main/scala/fs2/kafka/vulcan/AvroSerializer.scala

modules/vulcan/src/main/scala/fs2/kafka/vulcan/SchemaRegistryClientRetry.scala

bplommer · 2021-10-17T19:59:47Z

This looks great! I think I'm happy with the approach - any thoughts on this @vlovgr?

Co-authored-by: Ben Plommer <[email protected]>

vlovgr · 2021-10-19T07:11:53Z

Agree with @bplommer -- this looks great! Thanks @gordon-rennie!

gordon-rennie · 2023-04-19T09:55:27Z

Closing as master has drifted significantly - the last time I reviewed the situation, new changes on master made this approach unattractive and it would be better implemented by having a pure interface wrapping the Java schema registry client, then using retry combinators on the call sites.

Add retries to schema registry calls within Avro ser/des code

b7cca24

gordon-rennie commented Sep 5, 2021

View reviewed changes

modules/vulcan/src/main/scala/fs2/kafka/vulcan/AvroSettings.scala Show resolved Hide resolved

gordon-rennie commented Sep 5, 2021

View reviewed changes

bplommer reviewed Oct 17, 2021

View reviewed changes

modules/vulcan/src/main/scala/fs2/kafka/vulcan/AvroSerializer.scala Outdated Show resolved Hide resolved

bplommer reviewed Oct 17, 2021

View reviewed changes

modules/vulcan/src/main/scala/fs2/kafka/vulcan/SchemaRegistryClientRetry.scala Outdated Show resolved Hide resolved

bplommer reviewed Oct 17, 2021

View reviewed changes

modules/vulcan/src/main/scala/fs2/kafka/vulcan/SchemaRegistryClientRetry.scala Outdated Show resolved Hide resolved

bplommer reviewed Oct 17, 2021

View reviewed changes

modules/vulcan/src/main/scala/fs2/kafka/vulcan/SchemaRegistryClientRetry.scala Outdated Show resolved Hide resolved

gordon-rennie and others added 4 commits October 18, 2021 17:25

Use cats.syntax.all._

9da07c8

Co-authored-by: Ben Plommer <[email protected]>

Prefer method over function

6c644c5

Co-authored-by: Ben Plommer <[email protected]>

Better toString

e0fb7ea

Co-authored-by: Ben Plommer <[email protected]>

Better toString

6abf297

Co-authored-by: Ben Plommer <[email protected]>

gordon-rennie closed this Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retries for schema registry errors #683

Add retries for schema registry errors #683

gordon-rennie commented Sep 5, 2021 •

edited

Loading

gordon-rennie Sep 5, 2021 •

edited

Loading

gordon-rennie commented Oct 7, 2021

bplommer commented Oct 7, 2021

gordon-rennie commented Oct 7, 2021

bplommer commented Oct 17, 2021

vlovgr commented Oct 19, 2021

gordon-rennie commented Apr 19, 2023

Add retries for schema registry errors #683

Add retries for schema registry errors #683

Conversation

gordon-rennie commented Sep 5, 2021 • edited Loading

Goal

Implementation

Retry Interface Design

Proposed Interface

Alternative 1

Alternative 2

What Throwables should we retry?

TODO

gordon-rennie Sep 5, 2021 • edited Loading

Choose a reason for hiding this comment

gordon-rennie commented Oct 7, 2021

bplommer commented Oct 7, 2021

gordon-rennie commented Oct 7, 2021

bplommer commented Oct 17, 2021

vlovgr commented Oct 19, 2021

gordon-rennie commented Apr 19, 2023

gordon-rennie commented Sep 5, 2021 •

edited

Loading

What `Throwable`s should we retry?

gordon-rennie Sep 5, 2021 •

edited

Loading