Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying ops under load #123

Open
yaroslavputria opened this issue Dec 29, 2021 · 15 comments
Open

Applying ops under load #123

yaroslavputria opened this issue Dec 29, 2021 · 15 comments

Comments

@yaroslavputria
Copy link

Hi Guys!

About the project: it's a quill collaborative rich text editor. On the back-end side, there is a load balancer over 5 instances of sharedb connected through redis. Sharedb-mongo - is the db adapter.

I'm doing a kind of load testing. I'm running end-to-end tests which simulate users editing - each user produces changes with some frequency. I'm limited in resources so I'm able to run about 50 chromedriver instances (so 50 users). As I have 5 instances of sharedb I expect 5 users per document working concurrently. So these tests simulate 50 users typing within 10 documents. The consumed CPU and RAM look good. When I noticed that under such a load, synchronizing users takes time, I added logs to trace time of the applying op. And I see that applying an op mostly takes milliseconds. BUT sometimes it takes even more than 10 seconds - it's a case when "commit" is unsuccessful and sharedb has to re-fetch snapshot/ops (to re-transform the op). Most of this time is spent on accessing mongo.

Is there anything that I can improve? Or is it a limit?
I use only one mongo collection - will using multiple collections improve the experience?

Applying op timing:
`

+++ getSnapshot before 0
### collection is get 0
### beforeSnapshotLookup middlewate 0
### snapshot is fetched 0.001
+++ getSnapshot after 0.001
~~~ apply 0.001
~~~ commit 0.001
~~~ afterWrite 0.004
+++ getSnapshot before 0
### collection is get 0
### beforeSnapshotLookup middlewate 0
### snapshot is fetched 0.001
+++ getSnapshot after 0.001
~~~ apply 0.001
~~~ commit 0.001
~~~ afterWrite 0.003
+++ getSnapshot before 0
### collection is get 0
### beforeSnapshotLookup middlewate 0
### snapshot is fetched 0.851
+++ getSnapshot after 0.851
+++ getOpsToSnapshot before 0.851
### OP collection is get 0.851
### ops are fetched 1.774
+++ getOpsToSnapshot after 1.774
+++ ops transformed and will be applied 1.774
~~~ apply 1.774
~~~ commit 1.774
+++ unsuccessful commit -> do retry
+++ getSnapshot before 2.15
### collection is get 2.15
### beforeSnapshotLookup middlewate 2.15
### snapshot is fetched 2.152
+++ getSnapshot after 2.152
+++ getOpsToSnapshot before 2.152
### OP collection is get 2.152
### ops are fetched 2.154
+++ getOpsToSnapshot after 2.154
+++ ops transformed and will be applied 2.154
~~~ apply 2.154
~~~ commit 2.154
~~~ afterWrite 2.241`
@alecgibson
Copy link
Contributor

Thanks for the work you've done so far! Do you happen to have the MongoDB logs? They should contain the queries being made, as well as how long those queries are taking; it'd really help to narrow down if the issue is a slow query, or with sharedb-mongo (or both!).

@ericyhwang
Copy link
Contributor

Yeah, thanks for providing the detailed timing logs, those are illustrative!

On the first attempt, the two long delays are:

### beforeSnapshotLookup middlewate 0
### snapshot is fetched 0.851
...
### OP collection is get 0.851
### ops are fetched 1.774

Which are indeed both in sharedb-mongo + Mongo.

As Alec mentions, the slowness could be coming from:

  • MongoDB itself being slow at fulfilling the read requests for the snapshot and ops
  • Or the Node process may be blocked on something, perhaps on processing other requests

It would be interesting to take a performance profile of, say, 2 Node processes handling requests from 20 clients, to see if the bottleneck is Node process CPU, time spent waiting on Mongo, or something else entirely.

With just 2 Node processes, it should be feasible to run them both with --inspect, attach Chrome Dev Tools to both, and run the Performance profiler on both while running the load test. Or use your other favorite profiling tool.

@yaroslavputria
Copy link
Author

@alecgibson @ericyhwang thanks for the hints - I will collect more info.

@yaroslavputria
Copy link
Author

I haven't done profiling yet - it requires more effort to hook into our Kubernetes infrastructure... But I'll do it.

Meanwhile, I set the mongo log level to debug. Also, I added a log wrapped in nextTick before fetching ops/snapshot - to check if nodejs is overloaded. So, in the case when fetching ops/snapshot takes too much time, my nextTick log is called just after a few mongo logs (nodejs should not be overloaded) but there are a lot of calls (sometimes hundreds) to mongo before the ops/snapshot is really fetched.

Adapting the example above:

### beforeSnapshotLookup middlewate 0
// here, mongo logs a lot of queries
### snapshot is fetched 0.851 // looks like this fetch is queued

Attaching new logs with my comments (separated with //):
quick.log
slow.log

@ericyhwang
Copy link
Contributor

Based on that, it currently seems like the slowness is due to Mongo time and not Node time, so it's probably best to look further at Mongo for now instead of trying to get Node profiling working in Kubernetes.

The first slow part of the slow log has a bunch of o_NOTES_EDITOR_SHAREDB queries. One thing to check - Is there an appropriate index { d: 1, v: 1 } on o_NOTES_EDITOR_SHAREDB? If not, those would take a long time in Mongo and could slow down other queries.

Later on, there are also multiple queries against the NOTES_EDITOR_SHAREDB snapshot collection, which should be fast since they're id lookups. But if there are enough happening at once, that could slow things down. Or it could be residual processing on the op queries.

@yaroslavputria
Copy link
Author

Hi @ericyhwang,

One thing to check - Is there an appropriate index { d: 1, v: 1 } on o_NOTES_EDITOR_SHAREDB?

Thank you for trying to guide me, but I didn't get this - how to check this?

But if there are enough happening at once, that could slow things down. Or it could be residual processing on the op queries.

This is also unclear for me. What is "enough" in this context? I am actually doing this "load testing" to understand what I need to allow up to 500 users to work concurrently (up to 10 users per document). And at some point (~ 5 users per doc) scaling sharedb stopped improving the experience.

@alecgibson
Copy link
Contributor

@yaroslavputria
Copy link
Author

🤦, thanks @alecgibson.

Documents collection indexes:

[
   {
      "v":2,
      "key":{
         "_id":1
      },
      "name":"_id_",
      "ns":"yyp_ca_notes_editor.NOTES_EDITOR_SHAREDB"
   }
]

Ops collection indexes:

[
   {
      "v":2,
      "key":{
         "_id":1
      },
      "name":"_id_",
      "ns":"yyp_ca_notes_editor.o_NOTES_EDITOR_SHAREDB"
   },
   {
      "v":2,
      "key":{
         "d":1,
         "v":1
      },
      "name":"d_1_v_1",
      "ns":"yyp_ca_notes_editor.o_NOTES_EDITOR_SHAREDB",
      "background":true
   },
   {
      "v":2,
      "key":{
         "src":1,
         "seq":1,
         "v":1
      },
      "name":"src_1_seq_1_v_1",
      "ns":"yyp_ca_notes_editor.o_NOTES_EDITOR_SHAREDB",
      "background":true
   }
]

@alecgibson
Copy link
Contributor

Out of curiosity, have you got getOpsWithoutStrictLinking: true?

@alecgibson
Copy link
Contributor

Also do you have access to the MongoDB logs from the mongod process itself? By default I think this lives at /var/log/mongodb/mongodb.log and it might contain a bit more detailed information, such as how long each query is taking.

@yaroslavputria
Copy link
Author

Out of curiosity, have you got getOpsWithoutStrictLinking: true?

Yes, before I created this issue I had tried getOpsWithoutStrictLinking.

MongoDB logs from the mongod process

I'll try to collect them.

@yaroslavputria
Copy link
Author

I added more logs to figure out what is that queries queue which delays getSnapshot during a slow op - now I see that it's because of broadcasting presence data. That bunch of the same queries is a set of the getSnapshotOpLink requests (getSnapshotOpLink is a part of transformPresenceToLatestVersion). So client(s) produce this load by sharing presence data - will look more into this.

See the attached log file (it has more stacktrace information).
slow.log

@yaroslavputria
Copy link
Author

yaroslavputria commented Jan 17, 2022

Without presence it's faster than light.

@alecgibson
Copy link
Contributor

@yaroslavputria thanks for all the digging you've done on this!

Here are some of my initial thoughts:

We definitely don't want Presence bogging down "real" ShareDB work. In theory, Presence itself is OT, so should be able to work at a similar performance level.

I think the issue here is a combination of:

  • Presence's over-reliance on db.getOps()
  • The performance of db.getOps() itself?

Let's compare with SubmitRequest.submit(). In SubmitRequest.submit(), we:

This means that in the "happy" case (clients don't submit concurrent ops), we never have to fetch any ops, and we reuse the snapshot that we already had.

Contrast this with Agent._broadcastPresence:

Even in the happy case, we call getOps() every time there's a Presence update.

There's possibly some trickery we can do here to re-optimise for the "happy" case, because we transform Presence both server-side and client-side.

If you turned off server-side transformations, this would massively reduce the calls to the Database, because we'd only ever call it if the client needs to catch up some presence. This would come at the cost of some potential lag introduced into the Presence system where ops collide with Presence.

I've opened share/sharedb#538 to continue discussion down that thread.

@yaroslavputria would you be able to please re-run your performance test using that branch?

@yaroslavputria
Copy link
Author

Hi @alecgibson,
Yes, I'm going to re-run my tests using that branch.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants