|
| 1 | +--- |
| 2 | +adr: "0020" |
| 3 | +status: In progress |
| 4 | +date: 2023-07-13 |
| 5 | +tags: [server] |
| 6 | +--- |
| 7 | + |
| 8 | +# 0020 - Observability with OpenTelemetry |
| 9 | + |
| 10 | +<AdrTable frontMatter={frontMatter}></AdrTable> |
| 11 | + |
| 12 | +## Context and Problem Statement |
| 13 | + |
| 14 | +Along with the maturation of the codebase over the years, the number of users on the platform has |
| 15 | +also grown significantly and more insight is needed into how services are performing at a |
| 16 | +fine-grained level. External profilers can certainly be attached in any running environment, but the |
| 17 | +platform itself needs to offer internal metrics not just to support self-hosted customers running |
| 18 | +the product but to enable engineers to improve it and tackle performance issues with solid data and |
| 19 | +evidence as to what and why something should change. |
| 20 | + |
| 21 | +## Considered Options |
| 22 | + |
| 23 | +:::note |
| 24 | + |
| 25 | +Bitwarden currently uses [Datadog][dd] as its monitoring tool and desires to increase its usage by |
| 26 | +engineers across the board to improve what we deliver. |
| 27 | + |
| 28 | +::: |
| 29 | + |
| 30 | +- **Maintain current observability options** - Expect those running the platform to configure what |
| 31 | + they need outside of it for log collection and profiling / monitoring. |
| 32 | +- **Extend the platform to specifically support Datadog** - [Tracing for Datadog][ddtracer] exists |
| 33 | + in package form and could be coded into application startup. Datadog-specific signals and metrics |
| 34 | + can be collected via code and sent to the platform. |
| 35 | +- **Implement native instrumentation** - Add logic via what's available from |
| 36 | + [`System.Diagnostics`][native] for custom instrumentation, and expect profiling to be configured |
| 37 | + per the first option above. |
| 38 | +- **Use open observability standards** - Utilize [OpenTelemetry][otel] and emit signals on the |
| 39 | + console as well as utilize its own eventing approach for instrumentation and metrics data. |
| 40 | + |
| 41 | +## Decision Outcome |
| 42 | + |
| 43 | +Chosen option: **Use open observability standards**. |
| 44 | + |
| 45 | +A strong alternative exists in just using native instrumentation, and not tying the platform to the |
| 46 | +implementation of any specific ecosystem -- even an open standard like OpenTelemetry. .NET closely |
| 47 | +supports OpenTelemetry metric collection integration but the desired power will be in how that data |
| 48 | +is used via output mechanisms like OTLP. A profiler attached to running components is independent of |
| 49 | +the availability of metrics via other means such as collection by an agent. |
| 50 | + |
| 51 | +Accessibility to metrics via configuration wins out over the expectation to set up and manage a |
| 52 | +profiler. |
| 53 | + |
| 54 | +### Positive Consequences |
| 55 | + |
| 56 | +- Console logging of metrics, if desired for use, fits well into container and orchestration tools, |
| 57 | + and said environments can install agents for their collection. |
| 58 | +- No new dependencies that are merely aligned with the Bitwarden-specific cloud and its service |
| 59 | + providers. |
| 60 | +- Components can be monitored with far more detail and lead to future improvements. |
| 61 | +- Use of an open standard like OpenTelemetry creates future flexibility for monitoring and |
| 62 | + observability to grow with the expansion of that ecosystem, examples being the OTLP export vs. |
| 63 | + just console logging. |
| 64 | + |
| 65 | +### Negative Consequences |
| 66 | + |
| 67 | +- Addition of the OpenTelemetry dependency across all services. |
| 68 | +- Proprietary profiler implementations may offer signal information that OpenTelemetry can't, |
| 69 | + including automatic instrumentation. |
| 70 | +- With the capability to capture signals within the platform comes the burden of needing to maintain |
| 71 | + clear policies around not capturing sensitive data. |
| 72 | + |
| 73 | +### Plan |
| 74 | + |
| 75 | +.NET Core's `System.Diagnostics` library supports the emission of metrics compatible with |
| 76 | +OpenTelemetry, and traces and metrics within the platform will become available on the console and |
| 77 | +via OTLP export. Configuration will be provided to turn either on or off with new application |
| 78 | +settings e.g.: |
| 79 | + |
| 80 | +```json |
| 81 | +{ |
| 82 | + "OpenTelemetry": { |
| 83 | + "UseTracingExporter": "Console", |
| 84 | + "UseMetricsExporter": "Console", |
| 85 | + "Otlp": { |
| 86 | + "Endpoint": "http://localhost:4318" |
| 87 | + } |
| 88 | + } |
| 89 | +} |
| 90 | +``` |
| 91 | + |
| 92 | +Console and OTLP options will be available for the metrics and tracing export, along with the |
| 93 | +ability to specify a gRPC or HTTP endpoint for OTLP. Segmentation of activities will continue to be |
| 94 | +made using the configurable `ProjectName`. |
| 95 | + |
| 96 | +The initial implementation will provide default instrumentation details coming from ASP.NET Core and |
| 97 | +any used HTTP clients. Within Bitwarden the automatic instrumentation (profiler) may be explored at |
| 98 | +a future date but a code-first solution is desired to allow for more control and less setup during |
| 99 | +installation. It is expected that local processes will ingest logs / exports as desired. |
| 100 | + |
| 101 | +Software development lifecycle enhancements will be made to clarify best practices and review |
| 102 | +requirements for logging or monitoring changes. A [deep dive](/architecture/deep-dives) will be |
| 103 | +added on logging and monitoring to showcase patterns for adding signal collection in code. Only |
| 104 | +component runtime signals will be collected to start; no application payloads such as input and |
| 105 | +output data will be collected in signals. |
| 106 | + |
| 107 | +Over time and where needed, application logic to track custom [signals][otelsignals] (activities and |
| 108 | +meters) will be approached for deeper insights, especially in critical code paths. Standards will be |
| 109 | +developed and documented in the above deep dive on how to approach metric collection, without also |
| 110 | +collecting sensitive information. Core utility classes will be developed that establish a |
| 111 | +centralization of OpenTelemetry usage and make use in components easier and generic. |
| 112 | + |
| 113 | +Observability functionality will be moved to a new shared library -- separate from the core -- for |
| 114 | +host-oriented utilities. This library will be distributed as a NuGet package so that local `server` |
| 115 | +projects as well as new, independent repositories for services can receive the benefits. |
| 116 | + |
| 117 | +[dd]: https://www.datadoghq.com/ |
| 118 | +[ddtracer]: https://www.nuget.org/packages/Datadog.Trace.Bundle |
| 119 | +[native]: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/metrics-instrumentation |
| 120 | +[otel]: https://opentelemetry.io/ |
| 121 | +[otelsignals]: https://opentelemetry.io/docs/concepts/signals/ |
0 commit comments