Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepmind Performance Analysis #15

Open
jubeless opened this issue Mar 20, 2020 · 5 comments
Open

Deepmind Performance Analysis #15

jubeless opened this issue Mar 20, 2020 · 5 comments

Comments

@jubeless
Copy link
Contributor

jubeless commented Mar 20, 2020

We ran a reprocessing of eos-mainnet from block: 111,172,000 to 111,188,000, knowing the "problem area" is around block 111,172,500.

Test 1: Mindreader without deepmind (DM)(disabled it in config.ini)
image: gcr.io/eoscanada-shared-services/eos-mindreader:v2.0.1-dm-v10.4-712cf00-98a6fc0
This is our baseline test

Test 2: Mindreader with deepmind (DM) enabled & no console reader (CR) using a custom manageos branch
image: eos-mindreader:v2.0.3-dm-base-ubuntu-18.04-a8059ea
This is a unique scenario to understand the impact deepmind (DM) in isolation (not running a console reader)

Test 3: Mindreader with deepmind (DM) enabled & console reader (CR) enabled
image: gcr.io/eoscanada-shared-services/eos-mindreader:v2.0.1-dm-v10.4-712cf00-98a6fc0
This is a reproduction of production environement.

Test 4: Mindreader with deepmind (DM), no ABI serializer no out
image: gcr.io/eoscanada-shared-services/eos-mindreader:v2.0.3-dm-no-json-data-ubuntu-18.04-a8059ea

Test 5: Mindreader with deepmind (DM), no ABI seri & out in hex, with EOS binary encd
image: gcr.io/eoscanada-shared-services/eos-mindreader:v2.0.3-dm-pack-with-hex-output-ubuntu-18.04-a8059ea

Test 6: Mindreader with deepmind (DM), EOS binary encoding but no output
image: gcr.io/eoscanada-shared-services/eos-mindreader:v2.0.3-dm-pack-no-output-ubuntu-18.04-a8059ea

Test 7: Mindreader with deepmind (DM), ABI serializer no output
image: gcr.io/eoscanada-shared-services/eos-mindreader:v2.0.3-dm-no-json-data-with-computation-ubuntu-18.04-a8059ea

General Conclusion:

  • The mindreader with DM only (Test 2) relative to the mindreader with no DM (Test 1) is on average %753 slower. 1,000 blocks took on average 5m21s vs 38s
  • The mindreader with DM & CR (Test 3) relative to the mindreader with no DM is on average %786 slower. 1,000 blocks took on average5m57s vs 41s
  • Enabling deepmind adds a significant latency as seen in Test 2. Adding the console reader on top of DM (Test 3) adds a further but negligible latency as seen by comparing results from Test 2 & Test 3

Reproc YAMLS:

reproc-yaml.zip

@abourget
Copy link
Contributor

abourget commented Mar 20, 2020

Conclusion:

  • There might have been some big blocks, but a single (or a few big blocks) are not alone responsible for jamming the chain for 20 minutes.
  • There is, in the worse cases a 8x hit in perfs caused by deep-mind instrumentation. This needs to and can be addressed separately.

More research:

  • How many SWITCHED_FORK events did we having during that stalling period? Was deepmind constantly trying to execute and re-execute some blocks? Was it simply stalled?
    • Instrument the count of SWITCHED_FORK at the level of deep-mind, and a count of lines we process between each block. Some data we could gather WHILE it's running next time, that would indicate what is happening.
  • Ideally, next time this happens, we would be ready to do a profiling of nodeos, like we did in the early days, to see where the thing is at, when it seems completely stalled for 20 minutes.

@abourget
Copy link
Contributor

Backup nodes shown heavy transactions dropped, while mindreaders didn't.

Question:

  • Are backup nodes configured to speculatively execute transactions, or to process or relay transactions?
  • Why did the mindreaders not show the same sign? Logging configuration? Read-only options that would have that ignored?
  • Are the dropped trx something we see normally, or is it correlated with the event? A stackdriver log with a small graph would help correlate. <- @fproulx-eoscanada

@abourget
Copy link
Contributor

abourget commented Mar 20, 2020

Analysis of the prometheus graph reveals that:

  • Our data shows mindreader catched up ~1800 blocks (15 minutes worth of blocks) in around 15 seconds, which, according to the data above, shows it's impossible.

image
image

Conclusion:

  • It seems that there is something between nodeos' execution, and us reporting the head block drift data, that is incongruent.
  • We need better instrumentation closer to nodeos to understand what's going on within.

Next time we can:

  • Perhaps query nodeos directly to see if it's progressing, when our instrumentation reports drift.

@maoueh
Copy link
Contributor

maoueh commented Mar 28, 2020

Some reproduction steps:

@jubeless
Copy link
Contributor Author

jubeless commented Apr 1, 2020

Updated Experiment results:
image

data:
mindreader-analysis.xlsx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants