Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci(benchmarks): add/enable startup benchmark to run in ci #12563

Merged
merged 6 commits into from
Mar 4, 2025

Conversation

brettlangdon
Copy link
Member

@brettlangdon brettlangdon commented Feb 28, 2025

We had a previous half-finished startup benchmark. This PR cleans it up, adds a config, and enables it in CI.

This test uses subprocess.check_call() to test the time it takes to start and tear down a Python process with the given options/commands.

Benchmark summary

tl;dr; "How long does it take to get the user's original program started when dd-trace-py is added."

Benchmark designed to test the amount of overhead added to a process start time when loading/initializing dd-trace-py.

There are variants which also include importing a popular library/integration (flask) to help evaluate the impact on import/startup overhead when loading that library.

This benchmark can/should be updated to include other variants to validate certain features/products being enabled/disabled and the impact on startup overhead.

Notes:

  • We are also testing process teardown overhead as well since we start a process and wait for the process to exit in the total timing. This means if a variant generates any data to flush, then it may block shutdown waiting for that data to flush and may distort the reality of the overhead added.
  • We are, sort of, able to use PROFILE_BENCHMARKS=1 here. viztracer supports tracing into subprocesses, but it will only inject into processes which start with python, so the ddtrace-run variants won't get traced. I've also had mixed results depending on the variant, but it is technically possible, and worth trying/playing with.
    • Generating profiles locally is probably much easier and much more consistent, e.g. viztracer python -c "import ddtrace.auto" is faster/easier way to get profiling data for a given variant.
    • The profiles can get MASSIVE and then difficult to load, because we are starting a ton of subprocesses, and generating profiles for each of them. Another reason why running locally might be better/faster/etc.
  • We have added an env config option for future use, we are not currently using it. I imagined it could be use for product/feature enablement/disablement.

Variants

  • baseline: python -c ""
  • baseline_flask: python -c "import flask"
  • ddtrace_run: ddtrace-run python -c ""
  • ddtrace_run_flask: ddtrace-run python -c "import flask"
  • import_ddtrace: python -c "import ddtrace"
  • import_ddtrace_flask: python -c "import ddtrace; import flask"
  • import_ddtrace_auto: python -c "import ddtrace.auto"
  • import_ddtrace_auto_flask: python -c "import ddtrace.auto; import flask"
  • ddtrace_run_import_ddtrace_auto: ddtrace-run python -c "import ddtrace.auto"
  • ddtrace_run_import_ddtrace_auto_flask: ddtrace-run python -c "import ddtrace.auto; import flask"

Results

Comparing 2.21.3 against 3.1.0.

The following results were run locally from my laptop with ./scripts/perf-run-scenario startup "ddtrace~=2.21" "ddtrace~=3.0".

2.21.3 raw results
.....................
startup-baseline: Mean +- std dev: 5.07 ms +- 0.10 ms
.....................
startup-baseline_flask: Mean +- std dev: 65.2 ms +- 1.2 ms
.....................
startup-ddtrace_run: Mean +- std dev: 183 ms +- 3 ms
.....................
startup-ddtrace_run_flask: Mean +- std dev: 233 ms +- 3 ms
.....................
startup-import_ddtrace: Mean +- std dev: 66.0 ms +- 0.8 ms
.....................
startup-import_ddtrace_flask: Mean +- std dev: 111 ms +- 2 ms
.....................
startup-import_ddtrace_auto: Mean +- std dev: 127 ms +- 2 ms
.....................
startup-import_ddtrace_auto_flask: Mean +- std dev: 175 ms +- 4 ms
.....................
startup-ddtrace_run_import_ddtrace_auto: Mean +- std dev: 187 ms +- 2 ms
.....................
startup-ddtrace_run_import_ddtrace_auto_flask: Mean +- std dev: 236 ms +- 3 ms
3.1.0 raw results
.....................
startup-baseline: Mean +- std dev: 5.10 ms +- 0.09 ms
.....................
startup-baseline_flask: Mean +- std dev: 68.2 ms +- 1.7 ms
.....................
startup-ddtrace_run: Mean +- std dev: 194 ms +- 4 ms
.....................
startup-ddtrace_run_flask: Mean +- std dev: 244 ms +- 4 ms
.....................
startup-import_ddtrace: Mean +- std dev: 72.0 ms +- 1.2 ms
.....................
startup-import_ddtrace_flask: Mean +- std dev: 119 ms +- 4 ms
.....................
startup-import_ddtrace_auto: Mean +- std dev: 128 ms +- 2 ms
.....................
startup-import_ddtrace_auto_flask: Mean +- std dev: 178 ms +- 4 ms
.....................
startup-ddtrace_run_import_ddtrace_auto: Mean +- std dev: 193 ms +- 3 ms
.....................
startup-ddtrace_run_import_ddtrace_auto_flask: Mean +- std dev: 245 ms +- 6 ms
Comparison results
Benchmark startup/2.21.3/results.json startup/3.1.0/results.json
startup-baseline_flask 65.2 ms 68.2 ms: 1.05x slower
startup-ddtrace_run 183 ms 194 ms: 1.06x slower
startup-ddtrace_run_flask 233 ms 244 ms: 1.05x slower
startup-import_ddtrace 66.0 ms 72.0 ms: 1.09x slower
startup-import_ddtrace_flask 111 ms 119 ms: 1.07x slower
startup-import_ddtrace_auto 127 ms 128 ms: 1.01x slower
startup-import_ddtrace_auto_flask 175 ms 178 ms: 1.02x slower
startup-ddtrace_run_import_ddtrace_auto 187 ms 193 ms: 1.03x slower
startup-ddtrace_run_import_ddtrace_auto_flask 236 ms 245 ms: 1.04x slower
Geometric mean (ref) 1.04x slower

Benchmark hidden because not significant (1): startup-baseline

Analysis

  1. The overhead added for the 3.x variant is pretty consistent across all variants, including the baseline-flask. This likely means this benchmark has a potentially high level of variance to it, rather than pointing to 3.x actually being slower.
    1. Note: If this benchmark does have a high level of variance it might be too noisy to have it run on every PR, or we might need to consider a different threshold to alert on a regression.
    2. Running the benchmark a few times locally I am seeing the same slow down as consistent, but the baseline_flask being consistently slower really confuses me.
  2. I was expecting import flask to have a consistent overhead when comparing baseline_flask against import_ddtrace_flask. It is faster to import flask once ddtrace has already been imported (but no patching). My hypothesis here is that we have some shared dependencies (like importlib.metadata for example), and ddtrace is eating the overhead cost of importing these deps when loading that import flask can benefit from later.
    1. Goes from about 63 ms to import flask down to a consistent 50 ms.
    2. baseline_flask (68.2 ms) - baseline (5.1 ms) = 63 ms to import flask
    3. import_ddtrace_auto_flask (178 ms) - import_ddtrace_auto (128 ms) = 50 ms to import flask
    4. ddtrace_run_flask (244 ms) - ddtrace_run (194 ms) = 50 ms to import flask
  3. I was expecting ddtrace_run_import_ddtrace_auto to have the same overhead as import_ddtrace_auto variant, however only import ddtrace.auto is faster, and this is likely because ddtrace-run does some work, and then starts a subprocess and then basically does import ddtrace.auto. Comparing ddtrace_run against ddtrace_run_import_ddtrace_auto shows the same overhead, which makes sense.

Next steps

  1. Will be interesting to see results on new PRs to this benchmark. We have to keep a close eye on if it is "flaky".
  2. Would be interested to see products adding variants explicitly enabling/disabling their products/features to see startup impact of certain products.

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

@brettlangdon brettlangdon requested review from a team as code owners February 28, 2025 14:16
Copy link
Contributor

github-actions bot commented Feb 28, 2025

CODEOWNERS have been resolved as:

benchmarks/startup/config.yaml                                          @DataDog/apm-core-python
benchmarks/startup/requirements_scenario.txt                            @DataDog/apm-core-python
benchmarks/startup/scenario.py                                          @DataDog/apm-core-python

@christophe-papazian
Copy link
Contributor

It would be nice to simulate a real prod env. Looking at metabase, I see that many flask users also installed:
flask_cors
flask_sqlalchemy
flask_caching
flask_jwt_extended
flask_babel
flask_migrate
flask_appbuilder
flask_limiter
flask_compress
flask_session

For Django users:
django-cors-headers
django-oauth-toolkit
django-storages
django-appconf
django-filter
django_extensions

For FastAPI:
fastapi_cli
fastapi.utils
fastapi_pagination
fastapi.cli
fastapi_utils

for all users, most used:
certifi
pytz
cryptography
deprecated
botocore
boto3
aiohappyeyeballs
aiohttp
starlette
propcache
sqlalchemy

@pr-commenter
Copy link

pr-commenter bot commented Feb 28, 2025

Benchmarks

Benchmark execution time: 2025-03-04 14:08:14

Comparing candidate commit 09ee968 in PR branch brettlangdon/benchmark.startup with baseline commit 0924eae in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 429 metrics, 2 unstable metrics.

scenario:iast_aspects-ospathbasename_aspect

  • 🟩 execution_time [-318.218ns; -269.757ns] or [-8.852%; -7.504%]

@emmettbutler emmettbutler added the changelog/no-changelog A changelog entry is not required for this PR. label Feb 28, 2025
@brettlangdon
Copy link
Member Author

It would be nice to simulate a real prod env. Looking at metabase, I see that many flask users also installed: flask_cors flask_sqlalchemy flask_caching flask_jwt_extended flask_babel flask_migrate flask_appbuilder flask_limiter flask_compress flask_session

For Django users: django-cors-headers django-oauth-toolkit django-storages django-appconf django-filter django_extensions

For FastAPI: fastapi_cli fastapi.utils fastapi_pagination fastapi.cli fastapi_utils

for all users, most used: certifi pytz cryptography deprecated botocore boto3 aiohappyeyeballs aiohttp starlette propcache sqlalchemy

@christophe-papazian thinking more about this, we might just want to create dedicated tests for more realistic apps. I'd like to keep this benchmark as "micro" as we can, especially since we do have the flask_simple and django_simple benchmarks which do try to replicate an actual app (ok, not a lot of extensions/extras, but at least spin up django with an endpoint and hit it).

Maybe a better follow-up is having a flask_startup and django_startup type benchmarks instead to focus on the added overhead to various django set ups.

I think that and this benchmark could work together. e.g. if we see an increase in this benchmark, we'd expect to see an increase in the flask and django scenarios... or if we see an increase in the django startup benchmark, but not this one, then it helps narrow the problem.

Not 100% sold in either direction, cool with iterating as a follow-up?

@brettlangdon
Copy link
Member Author

Note the microbenchmark for this new benchmark is failing in CI because the scenario technically exists on the baseline, but it is missing a config.yaml, so it just fails with that file missing.

FileNotFoundError: [Errno 2] No such file or directory: 'config.yaml'

I'll investigate, not enabling the benchmark in CI during this PR and doing that as a follow-up might be the better option. I'll keep as enabled in CI for now so any iteration we can validate that it will execute properly on the candidate branch.

@gaogaotiantian
Copy link

viztracer supports tracing into subprocesses, but it will only inject into processes which start with python, so the ddtrace-run variants won't get traced.

This is added in gaogaotiantian/viztracer#562. Not sure when the next release will be but the feature will be there.

@brettlangdon
Copy link
Member Author

viztracer supports tracing into subprocesses, but it will only inject into processes which start with python, so the ddtrace-run variants won't get traced.

This is added in gaogaotiantian/viztracer#562. Not sure when the next release will be but the feature will be there.

hey @gaogaotiantian ! thanks for responding, and thanks for viztracer !

@gaogaotiantian
Copy link

hey @gaogaotiantian ! thanks for responding, and thanks for viztracer !

No problem! Let me know if you need anything from viztracer :)

@datadog-dd-trace-py-rkomorn
Copy link

Datadog Report

Branch report: brettlangdon/benchmark.startup
Commit report: a02e37f
Test service: dd-trace-py

✅ 0 Failed, 43 Passed, 215 Skipped, 55.93s Total duration (4m 51.98s time saved)

@brettlangdon brettlangdon enabled auto-merge (squash) March 4, 2025 13:24
@brettlangdon brettlangdon merged commit bcd4bf6 into main Mar 4, 2025
313 checks passed
@brettlangdon brettlangdon deleted the brettlangdon/benchmark.startup branch March 4, 2025 14:09
brettlangdon added a commit that referenced this pull request Mar 4, 2025
This finishes up the work from #12563 and enables the benchmarks to run.

They were failing before because the scenario technically existed on the
`main` branch, but not in a state where it could run successfully, so
the benchmark scenario would always fail.

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/no-changelog A changelog entry is not required for this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants