Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡️ Remove server-side Presence transformations #538

Closed
wants to merge 1 commit into from

Conversation

alecgibson
Copy link
Collaborator

@alecgibson alecgibson commented Jan 17, 2022

Fixes share/sharedb-mongo#123

This change addresses a performance issue with Presence, whereby every
single Presence update will cause a Database call to fetch the latest
ops, and transform Presence up.

This can be quite costly if there are a lot of Presence updates, and
adversely impact the "normal" operation of ShareDB. This is particularly
strange for the "happy case", where two clients are exchanging Presence
on the same version of a document (where there should be no need to
query the database at all for Presence).

We already have mechanisms in the client to transform presence, as well
as re-request stale presence, so we can remove the server-side Presence
transformations all together.

This alleviates pressure on the server, at the cost of decreased
performance in edge cases, where clients need to re-request Presence
from other clients, or catch up on cached ops. However, this should be a
sensible change to make, since we're trying to optimise for the happy
path, rather than edge cases.

Note that this changes a test case: we have to resubscribe doc1 in
this test case, otherwise it's never up-to-date, and presence2
continues to throw out its presence updates. This could happen in a
real-world situation (where doc1 stays offline), but it's arguable
that this is uninteresting presence anyway, since that client is not
"live".

@coveralls
Copy link

coveralls commented Jan 17, 2022

Coverage Status

Coverage decreased (-0.01%) to 97.391% when pulling febb54f on presence-performance into aa720e9 on master.

This change addresses a performance issue with Presence, whereby every
single Presence update will cause a Database call to fetch the latest
ops, and transform Presence up.

This can be quite costly if there are a lot of Presence updates, and
adversely impact the "normal" operation of ShareDB. This is particularly
strange for the "happy case", where two clients are exchanging Presence
on the same version of a document (where there should be no need to
query the database at all for Presence).

We already have mechanisms in the client to transform presence, as well
as re-request stale presence, so we can remove the server-side Presence
transformations all together.

This alleviates pressure on the server, at the cost of decreased
performance in edge cases, where clients need to re-request Presence
from other clients, or catch up on cached ops. However, this should be a
sensible change to make, since we're trying to optimise for the happy
path, rather than edge cases.

Note that this changes a test case: we have to resubscribe `doc1` in
this test case, otherwise it's never up-to-date, and `presence2`
continues to throw out its presence updates. This could happen in a
real-world situation (where `doc1` stays offline), but it's arguable
that this is uninteresting presence anyway, since that client is not
"live".
@yaroslavputria
Copy link

yaroslavputria commented Jan 18, 2022

@alecgibson, I've run the tests. Now it's faster a bit, but still, there are slow cases of op applying.
Seems that it's still because of presence - just look what is the difference in CPU consumptions when presence is enabled and disabled (tests are the same).

CPU usage of one of the sharedb pods.
tempsnip

New slow.log
The mongo queries log includes only queries which are currently sent to mongo but have not been processed yet (so this is kind of a pending responds queue from mongo). Also, I extended stack size a bit.

BTW, I have throttling for sending presence not often than every second.

@alecgibson
Copy link
Collaborator Author

@yaroslavputria any chance of more details on that CPU profile, like a flame chart? I can't really tell you what the issue is without more details, especially because your attached logs now have no mention of presence at all.

Also can you please remind me of the parameters of your load testing? I think Presence in its current form has always been vulnerable to some fan-out effects; we can try to minimise them if you can give us more details, please!

@curran
Copy link
Contributor

curran commented Jan 18, 2022

Very nice!

@ericyhwang
Copy link
Contributor

  1. Yeah, CPU profiler data from Node would definitely be the most helpful, as that would let us identify exactly what code is hot.

Some other things that could also help:
2) On a client, you can set connection.debug = true to log out all the client's messages to/from the server. That can help identify if there's some fan-out caused by the client code.
3) Similarly, on the server, backend.use('receive', (context) => console.log('RECV', context.data)) and backend.use('reply', console.log('SEND', context.reply)) will print out the messages the server is receiving/sending.

@yaroslavputria
Copy link

@alecgibson, @ericyhwang, thanks for the hints!

any chance of more details on that CPU profile, like a flame chart?

I'll try to get CPU profiling (but I think it will be on the next week).

Also can you please remind me of the parameters of your load testing?

30 users typing within 6 documents (5 users per document).
Typing simulation inserts a symbol and immediately moves the cursor after it - it happens every 100-400ms, and every 15 symbols there is an additional timeout for 3 seconds - I assume it's close to real typing.

@alecgibson
Copy link
Collaborator Author

@yaroslavputria could you please also try this by not moving the cursor immediately after the inserted character? Presence should already do all the heavy lifting for you by transforming ops on remote clients by the applied op, so you don't actually need to broadcast this information. Submitting presence at the same time as an op actually hits all sorts of fun presence edge cases, which is probably why you're seeing such a heavy CPU load. We transform presence client side to avoid the need for this sort of thing all together.

You'll still need to set the presence at the start of the test, obviously, and if you want to flex other bits of presence, you could add random presence moves in between ops.

@yaroslavputria
Copy link

yaroslavputria commented Jan 22, 2022

@alecgibson, I've just seen your last comment.
I'm not sure I understand what you mean. I already have throttling for submitting local presence (it's been submitted not often than ~1 second). And I was wrong when I said that the cursor is immediately set after the newly inserted character, it's being set after the previous text insert (I'm good with such a "delay" - I just want to simulate a realistic load on presence sharing). What do you suggest - should I submit presence more rarely (throttle more) or should the cursor be set somewhere within older ops? I understand that there is presence transforming on the client side, but I think it doesn't move the cursor after the newly inserted text, it just sticks to the previous character - or am I wrong?

@alecgibson
Copy link
Collaborator Author

@yaroslavputria I've updated the Presence example in #539

Basically, you only ever need to update your Presence when the user "actively" moves their selection. If their selection update is the result of an op, there's no need to send the information.

For example:

  1. Start with foo
  2. Click with your mouse to place your cursor: fo|o
  3. This needs a Presence update to be sent
  4. Move the cursor with the keyboard arrow: foo|
  5. This also needs a Presence update to be sent
  6. Type some new content: foobar|
  7. This does not need a Presence update to be sent, because the remote client will transform your current position by the new op. By not sending your Presence, you reduce network and performance overheads, and avoid edge cases around Presence and ops arriving out of order (which are heavy to recover from).

That being said, people might be moving around a document as other people are typing, so it's probably still worth trying different performance setups. A CPU flame chart would be super useful to debug where the performance bottlenecks are happening.

@yaroslavputria
Copy link

@alecgibson, thanks for the clarifications.
You are right, there are too many presence updates that cause such an overload. I've got flame charts - the hottest code relates to the presence messaging. From your clarifications now I understand that having presence handler on the "text-change" quill event is redundant (I've removed it) - this was the first reason for my performance issues. The second one was in my test simulation approach - now I do not set the cursor position each time when I add text.
Now everything looks good. Thanks everybody for the help!

@alecgibson
Copy link
Collaborator Author

Closing this as no longer required. If we think this might help in future, we can reopen.

@alecgibson alecgibson closed this Jul 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Applying ops under load
5 participants