Replies: 1 comment
-
Another variant of this problem occurs when we are using an in-order pub-sub API, e.g. an MQTT API, or any ephemeral event source to feed Pulsar. If we are running only one instance of the Pulsar Producer or Pulsar Source and the instance crashes or has network issues, some messages might never reach Pulsar. An obvious HA solution would have several identical instances of Producers running in parallel in different AZs, feeding one Pulsar topic with multiple copies of each message. How then do we retain only one copy of each unique message? Keep-last topic compaction could easily mess up the original order of the messages. Instead we can use a Pulsar Function or a Consumer-Producer to implement something similar to keep-first compaction. The Function needs to keep state somewhere for the set of keys or hashes of handled messages, maybe in Bookkeeper or another topic. This latter part of this HA feeder pattern would feel much more ergonomic with a built-in, keep-first, on-the-fly topic compaction. |
Beta Was this translation helpful? Give feedback.
-
Is your feature request related to a problem? Please describe.
It's often the case that a producer gets interrupted in the process of producing a series of messages to a topic, perhaps due to an application restart or crash. In many cases it is useful that only one message per key is produced to the topic. For example, if the messages represent emails to be sent, we may want only one email message to be sent to each email address on a list. Using sequence IDs may not be feasible in many such cases, because the underlying list is based on some dynamic features and is constantly changing, or because the underlying data store does not guarantee ordering. When the producer restarts, it should be able to start over and allow Pulsar to ignore the messages with keys it has already seen.
Describe the solution you'd like
Pulsar already supports topic compaction, in which it only keeps the latest message for each key. I propose that it also be possible, with some configuration, to keep the earliest message for a given key within the retention period. In other words, if Pulsar receives a new message with the same key, Pulsar will discard the message.
Describe alternatives you've considered
It is also possible to achieve something similar by storing keys that have already been produced in some other data store, but that requires making sure the secondary data store is in sync with the messages in Pulsar.
Beta Was this translation helpful? Give feedback.
All reactions