Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AP inbox: filter incoming records more aggressively #1341

Open
snarfed opened this issue Sep 22, 2024 · 13 comments
Open

AP inbox: filter incoming records more aggressively #1341

snarfed opened this issue Sep 22, 2024 · 13 comments

Comments

@snarfed
Copy link
Owner

snarfed commented Sep 22, 2024

Inspired by #1329, we can probably filter out a lot more incoming AP inbox deliveries before we store the activities and enqueue receive tasks for them. AP is now the majority of our receive tasks, and more tasks are still noops (204) than acted on (200 or 202), below.

image image
@snarfed snarfed added the now label Sep 22, 2024
@qazmlp
Copy link

qazmlp commented Sep 22, 2024

When doing so, please return appropriate HTTP status/error codes and clear messages for why an activity is unhandled.

If your architecture allows it, I'm quite sure it would also be more than fine to handle incoming activities as synchronous transactions (with appropriate parallelism and resource limits on relevant sections).

Backpressure appears to be an intentional aspect of the ActivityPub protocol, to the point where I probably wouldn't use a classic task queue to implement an inbox myself, at least whenever that allows for more accurate status codes also wrt. logical authorisation there (e.g. reply permissions).

@snarfed
Copy link
Owner Author

snarfed commented Sep 22, 2024

These generally aren't errors, they're valid activities that we just don't handle. For example, a non-bridged user replying to a bridged user, or a bridged user reposting a non-bridged user. I expect we'll still return 202 or maybe 204, like normal.

@snarfed
Copy link
Owner Author

snarfed commented Sep 22, 2024

I'm also curious what you mean by backpressure. The inbox delivery part of the AP spec is very brief, https://www.w3.org/TR/activitypub/#delivery , and seems entirely deterministic. Nothing on flow control or congestion control, which is what I generally understand backpressure to mean, at least coming from networking. Do you mean just surfacing more errors synchronously?

If so, agreed, that would be a huge help with debugging interop in general on the fediverse. It's tough: handling activities async is a really useful way to spread out load and handle it with a fixed set of resources, without auto scaling, which often isn't available to small or medium sized instances. Definitely makes debugging harder though!

@qazmlp
Copy link

qazmlp commented Sep 22, 2024

It's a few layers down, just plain HTTP and TCP. You can accept connections more slowly and send the response more slowly and that'll back things up across the network into the sender's outbox queue (depending a bit on the implementation).

You don't really need to race spikes on ActivityPub either, as the sender should back off and retry while spreading out requests more. Here's Mastodon's implementation of that: https://github.com/mastodon/mastodon/blob/7ed9c590b98610f8d68deab9ef8df260eec6d8f0/app/workers/activitypub/delivery_worker.rb#L11-L21

Mastodon allows you to go pretty slow. It times out individual requests with a 30 second deadline, which should give you plenty of time to do all the processing required: https://github.com/mastodon/mastodon/blob/7ed9c590b98610f8d68deab9ef8df260eec6d8f0/app/lib/request.rb#L15-L45

(Mastodon isn't super smart about this, so if a particular inbox is slow-without-timeout, that'll likely slow down the 'push' queue in general because there's afaict no limit on parallelism by inbox, but it won't cause permanent delivery failures.)

If so, agreed, that would be a huge help with debugging interop in general on the fediverse. It's tough: handling activities async is a really useful way to spread out load and handle it with a fixed set of resources, without auto scaling, which often isn't available to small or medium sized instances. Definitely makes debugging harder though!

I do think you should keep the delivery queue for outbound activities async!

I'm talking strictly about the 'receive, validate, authenticate and authorise' aspects of processing the incoming requests.


As an aside:

Somewhat annoyingly, I don't think Mastodon respects HTTP 429s and Retry-After, so it'll back off and retry as normal unless it receives a status code indicating 'unsalvageable' failure: https://github.com/mastodon/mastodon/blob/7ed9c590b98610f8d68deab9ef8df260eec6d8f0/app/helpers/jsonld_helper.rb#L207-L209

It'll also back off if there's a pileup of potentially salvageable HTTP request failures for a specific inbox, in which case it'll instantly reschedule requests to that inbox for a short time, I think. I'm not sure if that consumes a retry for them, since I'm unfamiliar with the Sidekiq API that's being used.

@qazmlp
Copy link

qazmlp commented Sep 22, 2024

These generally aren't errors, they're valid activities that we just don't handle. For example, a non-bridged user replying to a bridged user, or a bridged user reposting a non-bridged user. I expect we'll still return 202 or maybe 204, like normal.

It would be nice to at least produce 400 Bad Request for the invalid ones.

In my personal opinion, I think it would be "good form" to also reject activities where a user is expected to be notified (appears in the "to" field individually) but either post visibility is too low or the author of the incoming activity isn't bridged. It's not currently meaningful, but it would allow instances to inform their users a bit more cleanly about certain interaction failures in the future. 409 Conflict1 is what I would choose for this, as the semantics appear to match up nicely: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/409

That is of course very subjective on my part, and I'm neglecting e.g. the case where the "to"-audience contains both a web and Bluesky user and the delivery is done partially. Implementing #1340 instead would also largely help with the potential user confusion I'm concerned about here.

Footnotes

  1. Mastodon treats this as "unsalvageable", which causes the Sidekiq task to succeed cleanly and without adverse effects on other deliveries.

@snarfed
Copy link
Owner Author

snarfed commented Sep 22, 2024

Right. HTTP 429 is the closest we have to an actual application-level backpressure mechanism, and it's not in the AP spec or otherwise implemented to "slow down" inbox delivery on the sender's side in any fediverse software that we know of. Would be nice!

Mastodon allows you to go pretty slow. It times out individual requests with a 30 second deadline, which should give you plenty of time to do all the processing required

It's not about individual requests, it's about overall volume. Again, task queues let you throttle processing internally to handle an unbounded, bursty volume of traffic with a fixed set of resources, primarily CPU and memory. You can't do that if you handle everything synchronously. You either capacity plan for peak traffic, which is prohibitively expensive, or you auto scale, which isn't available architecturally on the basic VPSes or shared hosts that most instances are on.

@snarfed
Copy link
Owner Author

snarfed commented Sep 22, 2024

Regardless, again, I'm very much with you that the fediverse's async handling makes interop and debugging harder! I wrote about this somewhat at length a bit ago: http://web.archive.org/web/20240127153750/https://fed.brid.gy/docs#error-handling

BF originally handled all inbox deliveries synchronously. I eventually moved it to a task queue, which is why I removed that part of the docs.

@qazmlp
Copy link

qazmlp commented Sep 22, 2024

It's not about individual requests, it's about overall volume. Again, task queues let you throttle processing internally to handle an unbounded, bursty volume of traffic with a fixed set of resources, primarily CPU and memory. You can't do that if you handle everything synchronously. You either capacity plan for peak traffic, which is prohibitively expensive, or you auto scale, which isn't available architecturally on the basic VPSes or shared hosts that most instances are on.

Synchronously (wrt. to network responses) doesn't necessarily mean not using async primitives internally or absence of resource-limiting critical sections. What I'm saying is that it is okay for a receiving service to go visibly over-capacity in ActivityPub and fail to accept some inbox connections, without that leading to delivery failures.

Rejecting some connections or delaying the start of reading from the connection a bit to smear out bursts throttles the process externally across the adjacent ActivityPub network, without having to buffer any data in the inbox (so memory doesn't become a larger concern due to this). This can cause delays during sustained bursts, but won't easily cause missed events.

(If you do have to kick off an indefinite queued process that involves other instances like a parent-post fetch though, then yes it makes a lot of sense to say 202 Accepted and "queue up the continuation" there, so to speak.)


In case you have concerns about fairness and service-wide brown-outs:

Dedicated load balancers like HAProxy allow for very clean traffic policing by URL path based on connection queue lengths and per-client limits (with windowing to permit smaller connection bursts). This allows requests from small instances to always go through in a timely manner while a large instance that would exceed the available resources due to its own user activity can be forced to back off more or less politely.

I didn't look up the numbers for other software, but retry smear by Mastodon follows the fourth power of the retry count and tops out in the several-hours range for the last step. I think it's reasonable to expect ActivityPub servers to react fairly gracefully if they get rate-limit at the transport layer and to support long delivery windows.


All that said, it is a matter of preferences. To me it does seem practically feasible (to put the resource burden mainly on the party that immediately causes the inbound traffic).

I also really don't know a lot about service architecture that uses cloud services rather than dedicated servers or a VPS, so there's a good chance that what I wrote isn't applicable to Bridgy Fed for another reason.

@snarfed
Copy link
Owner Author

snarfed commented Sep 23, 2024

It's definitely an interesting idea! Not one I expect to pursue much here in Bridgy Fed - I generally want to be a follower on interop-sensitive techniques like this, not a leader - but I'd definitely encourage you to try it yourself if you're interested!

Fwiw, here are all the devils in the details I see so far. (I expect there are more.)

  • As we determined, HTTP 429 may be the ideal application-level backpressure mechanism, but few if any fediverse servers currently support it.
  • We could return it anyway, or a similar error code, and Mastodon may be good at retrying, but lots of other fediverse servers will be mediocre or bad at it. I'm reluctant to sacrifice interop and the user-facing quality of the bridge for this.
  • "Holding" requests to slow them down is interesting, but technically problematic. When a spike of thousands of inbox deliveries come in in just a few seconds, I'd have to hold all of those network connections open while I process them all. Depending on web server, framework, etc, that's often not easy or cheap.
  • Also, if you're on serverless or another form of auto scaling - which BF's public-facing serving is - request latency is one of the primary factors. You can tune auto scaling to some degree, but only somewhat. If this technique makes you consistently spin up and pay for lots more instances than you'd otherwise need, that's a problem.
  • Just like fediverse servers have a range of retrying, and we can't really know them all ahead of time, fediverse servers will have a range of timeouts. Mastodon's 30s may be nice, but it's just one.
  • ...and depending on CPU and memory, 30s often won't be anywhere near enough to work through multiple thousand activities.
  • To some degree, retries are counterproductive. Yes, they spread out load over time. However, they do that by adding load, which may well exacerbate these problems more than help them.
  • Many activities require external fetches to handle, and it's hard to know whether a given activity will require one until you dive into actually processing it.

@qazmlp
Copy link

qazmlp commented Sep 23, 2024

Thank you for this! (And thank you for your patience with me. I know I go off on not-that-helpful tangents sometimes.)

I saved this list so that I'll have it on hand when I eventually get around to doing something practically with AP.

(I have some vague project ideas, but I'm currently too swamped with other things to start any new ones. Still, I feel like I will be able to tackle them much more efficiently thanks to the knowledge I'm gaining here.)

@snarfed
Copy link
Owner Author

snarfed commented Oct 1, 2024

One catch here is that unlike Bluesky, we default to bridging web sites to/from the fediverse, so incoming AP activities aren't as easy to discard as Bluesky events. Doable, but I suspect it may be less of a win.

@snarfed
Copy link
Owner Author

snarfed commented Oct 1, 2024

I did a brief survey of a small sample of incoming AP activities that we didn't handle, like #1329 (comment).

  • 6x repost of non-bridged note
  • 2x delete non-bridged note
  • update question (ie poll)

OK, so lots of reposts of non-bridged posts. Most of those will be because the author isn't bridged, which means they'll be relatively straightforward to filter out. For the rest, and for the deletes, we have to look up the object itself to see if it was bridged.

Updates of polls (ie poll answers) will also be straightforward to filter out based on type. (We should already be doing this!)

Incidentally, I also saw lots of ATProto undo follows. We pass through all ATProto deletes and undos since we can't know whether the original object they're deleting/undoing was bridged until we look it up. I wonder if we could do better there.

@snarfed
Copy link
Owner Author

snarfed commented Oct 1, 2024

I'm going to deprioritize this in favor of #1354.

@snarfed snarfed removed the now label Oct 1, 2024
@Tamschi Tamschi added the infra label Oct 31, 2024
@snarfed snarfed added the now label Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants