-
Notifications
You must be signed in to change notification settings - Fork 192
feat: prometheus sink for gqlmetrics, ensure all operation are tracked efficiently #2332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis PR restructures Prometheus schema field usage metrics export from a sampling-based approach to a configurable batch exporter framework. It introduces a generic Exporter pattern, separates GraphQL and Prometheus metric exporters into dedicated implementations, removes per-operation sampling logic, and updates configuration/wiring across the router core and metrics pipeline. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Possibly related PRs
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Comment |
Router image scan passed✅ No security vulnerabilities found in image: |
Router-nonroot image scan passed✅ No security vulnerabilities found in image: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
router/core/graph_server.go (1)
758-775: Shutdown ordering: stop exporter before closing metric store.Currently metricStore is shut down before prometheusMetricsExporter. If exporter is still flushing, writes can fail. Shut down the exporter first, then the stores.
Apply:
- if s.metricStore != nil { - if aErr := s.metricStore.Shutdown(ctx); aErr != nil { - err = errors.Join(err, aErr) - } - } - ... - if s.prometheusMetricsExporter != nil { - if aErr := s.prometheusMetricsExporter.Shutdown(ctx); aErr != nil { - err = errors.Join(err, aErr) - } - } + if s.prometheusMetricsExporter != nil { + if aErr := s.prometheusMetricsExporter.Shutdown(ctx); aErr != nil { + err = errors.Join(err, aErr) + } + } + if s.metricStore != nil { + if aErr := s.metricStore.Shutdown(ctx); aErr != nil { + err = errors.Join(err, aErr) + } + }
🧹 Nitpick comments (14)
router/internal/graphqlmetrics/graphql_metrics_sink.go (2)
29-36: Handle potential nil logger inNewGraphQLMetricsSink
cfg.Logger.With(...)will panic ifLoggeris ever nil (e.g. in tests or ad-hoc usage). Consider either guarding and falling back tozap.NewNop()or documenting/enforcing a non-nil logger requirement at call sites.
70-91: Consider tightening retry classification inIsRetryableErrorRight now any non-
connect.Error(including context cancellations or local/config errors) is treated as retryable. If such errors can reach this helper, consider explicitly short‑circuiting oncontext.Canceled/context.DeadlineExceededor other known permanent failures in the caller to avoid pointless retries.router/pkg/config/config.schema.json (1)
1249-1281: Exporter schema matches the new Prometheus usage exporter configThe
schema_usage.exporterblock (batch_size, queue_size, interval, export_timeout) and defaults are consistent and align with the new Go config type. If you want stricter validation, you could optionally add"additionalProperties": falseto theexporterobject to catch typos in config keys, but the current schema is functionally fine.router/internal/graphqlmetrics/exporter_test.go (1)
369-375: Avoid fixed sleeps in tests; poll until metrics observed.Replace time.Sleep with polling (e.g., require.Eventually) or call Shutdown before assertions to reduce flakiness.
Example:
- time.Sleep(200 * time.Millisecond) + require.Eventually(t, func() bool { + _, err := promRegistry.Gather() + return err == nil && len(c.publishedAggregations) > 0 + }, 2*time.Second, 50*time.Millisecond)Also applies to: 371-375
router-tests/testenv/testenv.go (1)
279-290: Clamp exporter settings to safe minimums in tests.If a caller supplies zeros, normalize to sane minimums (e.g., BatchSize>=1, QueueSize>=1, Interval>0) before wiring to router metric config to prevent silent stalls.
Also applies to: 1508-1552
router-tests/prometheus_improved_test.go (1)
82-96: Reduce flakiness: replace fixed sleeps with Eventually/polling.Use require.Eventually to wait for metrics flush instead of fixed sleeps. Keeps tests stable on slow CI.
Also applies to: 166-171, 229-233, 269-274, 315-321, 359-365, 413-419
router/core/graph_server.go (1)
923-949: Defaults/validation pass-through for exporter settings.When wiring settings from cfg.Exporter, ensure they are validated/clamped (BatchSize, QueueSize, Interval, ExportTimeout) before use to prevent zero/negative configs from stalling the exporter.
router/core/router.go (2)
841-851: Prometheus schema usage: consider logging exporter settings as wellThe enablement log for Prometheus schema field usage is clear. Since batching/queue behavior is now configurable, consider also logging the effective exporter settings (batch size, queue size, interval, export timeout) here to simplify debugging misconfigurations.
2324-2332: Exporter settings mapping is correct; be aware of validation constraintsThe mapping from
cfg.Metrics.Prometheus.SchemaFieldUsage.Exporterintormetric.PrometheusSchemaFieldUsageExporteris straightforward and correct. Note that the generic exporter’svalidate()requires all these numeric fields to be strictly positive, so a YAML/env override of0or negative values will now fail exporter creation. If that’s not desired, you may want to clamp or normalize these values earlier in config handling.router/core/router_metrics.go (1)
135-159: Prometheus export path is lightweight; consider extra guard and hash semanticsThe Prometheus-specific
SchemaUsageInfopayload deliberately includes only type field metrics, operation hash/name, and schema version, which is a good fit for local metric cardinality and memory use.Two minor points:
- To harden against future misuse, you could early-return if
m.prometheusMetricsExporteris nil, even though current callers gate onPrometheusUsageInfoTrack.- This path uses
operationContext.sha256Hashdirectly instead ofHashString(). Please confirm thatsha256Hashis always initialized and matches the intended “operation SHA” semantics for all operation types.router/internal/graphqlmetrics/exporter.go (4)
78-111: NewExporter wiring is correct; consider guarding nil settingsConstructor behavior is good: it enforces a non-nil sink, installs a default retryability handler, validates settings, and starts the batching goroutine with a component-scoped logger.
One robustness nit:
settingsis dereferenced without a nil check when creating the queue and during validation. Since this is an internal API, it’s probably fine, but you might want to either:
- Treat
nilas “useNewDefaultExporterSettings()”, or- Explicitly return an error like
fmt.Errorf("settings cannot be nil").This avoids surprising panics if a future caller accidentally passes
nil.
113-143: Validation enforces positive values; retries disabled still require fields
validate()enforces strictly positive values for batch/queue sizes, interval, export timeout, and all retry fields. That’s good for catching misconfigurations early.Note that it requires retry fields to be positive even when
RetryOptions.Enabledisfalse. If you later exposeEnabledas a user-facing toggle, you may want to relax validation so that only enabled retries need fully-populated options.
179-282: Batch export and retry logic are reasonable; consider respecting shutdown context for backoff
exportBatchwraps the shared export context with a per-call timeout and delegates to the sink, logging failures at debug.exportBatchWithRetryadds bounded exponential backoff with:
- Retryability decided by
isRetryableError.- Max duration, interval, and max attempts from settings.
- Clear logs on non-retryable errors and after exhausting retries.
One potential refinement: the retry loop currently sleeps with
time.Sleepand doesn’t inspect any shutdown signal or context, so a stubbornly failing sink can delay shutdown up to the cumulative backoff time. If you ever need faster shutdowns under failure, you could periodically checke.exportRequestContext.Done()or a dedicated shutdown flag inside the retry loop to abort early.
344-381: Shutdown sequence is mostly robust; document single-use expectationShutdown:
- Stops accepting traffic, signals the batching goroutine to drain and exit, and then polls
inflightBatchesuntil zero or context cancel.- Defers cancellation of export contexts and sink closure, with logs on errors and on completion.
Two small caveats:
Shutdownis not idempotent: calling it twice will panic on the second close ofacceptTrafficSema/shutdownSignal. It would help to either document “must only be called once” or add a guard (e.g.,sync.Once).- Synchronous
Record(..., true)calls are not tracked ininflightBatches, so they won’t delay shutdown waiting; they may still race with sink.Close if invoked concurrently. That’s likely acceptable for metrics but worth keeping in mind.Overall, the shutdown strategy is consistent with a best-effort metrics pipeline.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
router-tests/prometheus_improved_test.go(16 hunks)router-tests/testenv/testenv.go(3 hunks)router/core/graph_server.go(4 hunks)router/core/operation_metrics.go(3 hunks)router/core/router.go(3 hunks)router/core/router_config.go(1 hunks)router/core/router_metrics.go(4 hunks)router/internal/graphqlmetrics/exporter.go(5 hunks)router/internal/graphqlmetrics/exporter_test.go(7 hunks)router/internal/graphqlmetrics/graphql_exporter.go(1 hunks)router/internal/graphqlmetrics/graphql_metrics_sink.go(1 hunks)router/internal/graphqlmetrics/prometheus_exporter.go(1 hunks)router/internal/graphqlmetrics/prometheus_sink.go(1 hunks)router/internal/graphqlmetrics/sink.go(1 hunks)router/pkg/config/config.go(1 hunks)router/pkg/config/config.schema.json(1 hunks)router/pkg/metric/config.go(2 hunks)
🧰 Additional context used
🧠 Learnings (13)
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Applied to files:
router/internal/graphqlmetrics/prometheus_exporter.gorouter/core/router_config.gorouter/core/router_metrics.gorouter/core/graph_server.gorouter/core/router.go
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.
Applied to files:
router-tests/testenv/testenv.gorouter-tests/prometheus_improved_test.gorouter/internal/graphqlmetrics/exporter_test.gorouter/core/operation_metrics.go
📚 Learning: 2025-09-29T10:28:07.122Z
Learnt from: dkorittki
Repo: wundergraph/cosmo PR: 2192
File: router/pkg/config/config.go:1028-1029
Timestamp: 2025-09-29T10:28:07.122Z
Learning: The deprecation strategy for IntrospectionEnabled field in router/pkg/config/config.go is to first mark it as deprecated, then migrate all call sites to use IntrospectionConfig.Enabled, and finally remove the deprecated field. The envDefault tag should be kept during the deprecation period for backward compatibility.
Applied to files:
router-tests/testenv/testenv.go
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.
Applied to files:
router/core/router_config.gorouter/core/graph_server.gorouter/core/router.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
Applied to files:
router/core/router_config.gorouter/pkg/config/config.schema.json
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.
Applied to files:
router/core/router_config.gorouter/core/router_metrics.gorouter/core/graph_server.go
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.
Applied to files:
router-tests/prometheus_improved_test.gorouter/core/graph_server.gorouter/core/router.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/pkg/config/config.schema.json
📚 Learning: 2025-07-21T15:06:36.664Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2067
File: router/pkg/config/config.schema.json:1637-1644
Timestamp: 2025-07-21T15:06:36.664Z
Learning: In the Cosmo router project, when extending JSON schema validation for security-sensitive fields like JWKS secrets, backwards compatibility is maintained by implementing warnings in the Go code rather than hard validation constraints in the schema. This allows existing configurations to continue working while alerting users to potential security issues.
Applied to files:
router/pkg/config/config.schema.json
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
Applied to files:
router/core/operation_metrics.gorouter/core/router_metrics.gorouter/core/graph_server.go
📚 Learning: 2025-08-26T17:19:28.178Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.
Applied to files:
router/core/operation_metrics.gorouter/pkg/metric/config.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.
Applied to files:
router/core/graph_server.go
🧬 Code graph analysis (15)
router/internal/graphqlmetrics/graphql_exporter.go (2)
router/internal/graphqlmetrics/exporter.go (3)
Exporter(16-31)ExporterSettings(50-61)NewExporter(82-111)router/internal/graphqlmetrics/graphql_metrics_sink.go (3)
NewGraphQLMetricsSink(30-36)GraphQLMetricsSinkConfig(23-27)IsRetryableError(72-91)
router/internal/graphqlmetrics/prometheus_exporter.go (3)
router/internal/graphqlmetrics/exporter.go (3)
Exporter(16-31)ExporterSettings(50-61)NewExporter(82-111)router/pkg/metric/metric_store.go (1)
Store(149-162)router/internal/graphqlmetrics/prometheus_sink.go (2)
NewPrometheusSink(41-47)PrometheusSinkConfig(25-29)
router-tests/testenv/testenv.go (2)
router/pkg/config/config.go (2)
PrometheusSchemaFieldUsageExporter(121-126)PrometheusSchemaFieldUsage(115-119)router/pkg/metric/config.go (2)
PrometheusSchemaFieldUsageExporter(48-53)PrometheusSchemaFieldUsage(42-46)
router/pkg/config/config.go (3)
router/internal/graphqlmetrics/exporter.go (1)
Exporter(16-31)router-tests/testenv/testenv.go (1)
PrometheusSchemaFieldUsageExporter(285-290)router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
Store(149-162)router/pkg/otel/attributes.go (5)
WgOperationName(10-10)WgOperationType(11-11)WgOperationSha256(59-59)WgGraphQLFieldName(60-60)WgGraphQLParentType(61-61)router/pkg/graphqlschemausage/schemausage.go (1)
TypeFieldMetrics(28-28)
router/internal/graphqlmetrics/graphql_metrics_sink.go (1)
router/internal/graphqlmetrics/aggregate.go (1)
AggregateSchemaUsageInfoBatch(9-27)
router/core/router_config.go (2)
router/internal/graphqlmetrics/graphql_exporter.go (1)
GraphQLMetricsExporter(13-15)router/internal/graphqlmetrics/prometheus_exporter.go (1)
PrometheusMetricsExporter(13-15)
router/internal/graphqlmetrics/exporter.go (1)
router/internal/graphqlmetrics/sink.go (2)
Sink(9-18)SinkErrorHandler(23-23)
router-tests/prometheus_improved_test.go (4)
router-tests/testenv/testenv.go (3)
Run(105-122)PrometheusSchemaFieldUsageExporter(285-290)Environment(1763-1799)router/pkg/config/config.go (1)
PrometheusSchemaFieldUsageExporter(121-126)router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)router/pkg/otel/attributes.go (2)
WgOperationName(10-10)WgOperationType(11-11)
router/internal/graphqlmetrics/exporter_test.go (1)
router/internal/graphqlmetrics/graphql_exporter.go (1)
NewGraphQLMetricsExporter(19-39)
router/core/operation_metrics.go (1)
router/core/router_metrics.go (1)
RouterMetrics(14-21)
router/core/router_metrics.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
GraphQLMetricsExporter(13-15)router/internal/graphqlmetrics/prometheus_exporter.go (1)
PrometheusMetricsExporter(13-15)router/core/context.go (4)
OperationType(503-503)OperationTypeQuery(506-506)OperationTypeMutation(507-507)OperationTypeSubscription(508-508)router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (16)
OperationType(26-26)OperationType(58-60)OperationType(62-64)OperationType(71-73)OperationType_QUERY(29-29)OperationType_MUTATION(30-30)OperationType_SUBSCRIPTION(31-31)SchemaUsageInfo(130-153)SchemaUsageInfo(168-168)SchemaUsageInfo(183-185)OperationInfo(362-373)OperationInfo(388-388)OperationInfo(403-405)SchemaInfo(428-435)SchemaInfo(450-450)SchemaInfo(465-467)
router/pkg/metric/config.go (3)
router/internal/graphqlmetrics/exporter.go (1)
Exporter(16-31)router-tests/testenv/testenv.go (1)
PrometheusSchemaFieldUsageExporter(285-290)router/pkg/config/config.go (1)
PrometheusSchemaFieldUsageExporter(121-126)
router/core/graph_server.go (4)
router/internal/graphqlmetrics/prometheus_exporter.go (2)
PrometheusMetricsExporter(13-15)NewPrometheusMetricsExporter(19-45)router/pkg/config/config.go (1)
Prometheus(99-113)router/internal/graphqlmetrics/exporter.go (3)
ExporterSettings(50-61)Exporter(16-31)RetryOptions(33-38)router/core/router_metrics.go (1)
NewRouterMetrics(43-52)
router/core/router.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
NewGraphQLMetricsExporter(19-39)router/pkg/config/config.go (3)
Prometheus(99-113)PrometheusSchemaFieldUsageExporter(121-126)Metrics(137-142)router/internal/graphqlmetrics/exporter.go (1)
Exporter(16-31)router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: build-router
- GitHub Check: build_push_image
- GitHub Check: build_test
- GitHub Check: integration_test (./telemetry)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: integration_test (./events)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: image_scan
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_test
🔇 Additional comments (19)
router/internal/graphqlmetrics/graphql_metrics_sink.go (1)
40-61: Export path and aggregation look correctBatch empty-check, aggregation via
AggregateSchemaUsageInfoBatch, auth header setup, and debug logging are all straightforward and concurrency-safe for this sink implementation. No functional issues from this diff.router/core/router_config.go (1)
37-38: New exporter fields fit the updated metrics wiringAdding
gqlMetricsExporterandprometheusMetricsExportertoConfigkeeps the router’s metrics wiring explicit and type‑safe; no issues with this change.router/internal/graphqlmetrics/sink.go (1)
7-23: Sink[T] abstraction is clear and compatible with the exporter designThe generic
Sink[T]interface andSinkErrorHandlerare minimal, well‑documented, and match the batch‑export semantics used by the generic exporter. No changes needed here.router/pkg/metric/config.go (1)
42-53: ****Verification confirms that all constructions of
PrometheusSchemaFieldUsagewith theExporterfield properly handle initialization:
- Production (router.go:2324): Reads all
Exporterfields from validated config source- Tests (testenv.go:1509-1530): Explicitly provides test-friendly defaults when enabled (BatchSize: 100, QueueSize: 1000, Interval: 100ms, ExportTimeout set)
No manual constructions with uninitialized
Exporterfields exist in the codebase. The code is correct as-is.router/internal/graphqlmetrics/prometheus_exporter.go (1)
31-41: LGTM on exporter wrapper.Constructor and delegation are clear; disabling retries for local Prom sink is appropriate.
Also applies to: 47-58
router/internal/graphqlmetrics/graphql_exporter.go (1)
31-39: LGTM on GraphQL exporter wrapper.API is minimal and mirrors the generic exporter cleanly.
Also applies to: 41-52
router-tests/prometheus_improved_test.go (1)
351-358: Review comment is incorrect; no changes required.The repository targets Go 1.25 (confirmed across all
go.modfiles), which fully supports thefor range nsyntax introduced in Go 1.22. The codebase consistently uses this syntax in multiple test files (prometheus_improved_test.go, authentication_test.go, plugintest/tracing_test.go). No compatibility concern exists; the code is idiomatic for the current toolchain.Likely an incorrect or invalid review comment.
router/internal/graphqlmetrics/prometheus_sink.go (1)
79-86: Remove this review comment; the Go version compatibility concern is invalid.The codebase targets Go 1.25, and
slices.Concatwas introduced in Go 1.22, so the "newer Go" compatibility argument doesn't apply. The code is compatible with the project's Go version.While the nil vs. empty slice suggestion (
[]attribute.KeyValue{}→nil) is a minor stylistic improvement, the original review overstates the concern about Go compatibility and allocation efficiency in this context.router/pkg/config/config.go (1)
115-126: No changes required; validation is already in place.The PrometheusSchemaFieldUsageExporter fields are validated both at the JSON schema level and at runtime:
- Runtime: The
Exporter[T].validate()method (router/internal/graphqlmetrics/exporter.go:113–142) enforces BatchSize > 0, QueueSize > 0, Interval > 0, and ExportTimeout > 0.- Schema: The config.schema.json (lines 1252–1278) defines
minimum: 1for batch_size/queue_size andduration: {minimum: "1s"}for interval/export_timeout.Both layers prevent misconfiguration, so no additional validation is needed.
router/core/router.go (1)
821-839: GraphQL metrics exporter wiring looks consistentUsing
NewGraphQLMetricsExporterwithNewDefaultExporterSettings()cleanly adapts to the new generic exporter API; error handling and assignment intor.gqlMetricsExporterremain correct.router/core/router_metrics.go (3)
14-21: RouterMetrics wiring for dual exporters is soundExtending
RouterMetricsandrouterMetricsto carry both the GraphQL and Prometheus exporters, plus an explicitexportEnabledflag for cloud usage, cleanly separates concerns.NewRouterMetricscorrectly threads these through, andStartOperation’sPrometheusUsageInfoTrack: m.prometheusMetricsExporter != nilgives a cheap, stable switch for local Prometheus tracking.Also applies to: 26-41, 44-51
57-67: StartOperation gating correctly enables Prometheus usage trackingPassing
TrackUsageInfoandPrometheusUsageInfoTrackintoOperationMetricsOptionsensures:
- Cloud GraphQL usage export is still controlled centrally via
exportEnabled.- Prometheus schema usage export is enabled solely by the presence of a Prometheus exporter.
This matches the “track all operations” goal without coupling the two sinks.
75-81: Exporter accessors are minimal and sufficient
GqlMetricsExporter()andPrometheusMetricsExporter()simply expose the underlying exporters, matching the new Prometheus path needed by the graph mux. No additional state is leaked.router/core/operation_metrics.go (2)
32-40: OperationMetrics option plumbing is consistentAdding
trackUsageInfoandprometheusUsageInfoTracktoOperationMetricsand threading them viaOperationMetricsOptionskeeps the decision about which sinks to hit local to the operation instance. The constructor correctly snapshots these booleans at operation start, avoiding any mid-request config races.Also applies to: 91-100, 105-117
77-88: Dual export in Finish aligns with tracking goalsThe new export block:
- Respects
SkipLoaderandreqContext.operation == nilas before.- Uses
trackUsageInfoto guard GraphQL (cloud) schema usage export.- Uses
prometheusUsageInfoTrackto independently guard the Prometheus schema usage path.This achieves “always track operations” for Prometheus while preserving existing cloud usage behavior.
router/internal/graphqlmetrics/exporter.go (4)
13-31: Generic exporter structure and defaults look solidThe generic
Exporter[T]plusExporterSettingsandNewDefaultExporterSettings()give a reusable, thread-safe batching primitive. The fields (queue,inflightBatches,exportRequestContext) and defaults (batch size, queue size, interval, timeouts, retry) are reasonable and match the use cases for GraphQL and Prometheus sinks.Also applies to: 50-76
145-177: Traffic gating and queue backpressure are correctly implemented
acceptTraffic()uses a closed channel to signal shutdown, andRecord:
- Sends synchronously via
exportBatchwhen requested.- Uses a non-blocking send to a bounded
queueotherwise, logging and dropping when full.This prevents producer goroutines from blocking indefinitely on shutdown or overload. The synchronous path intentionally ignores the error (just logs inside
exportBatch), which matches a “best-effort metrics” philosophy.
285-319: Batching loop and shutdown signal handling look correctThe
startloop:
- Maintains a preallocated buffer sized to
BatchSize.- Flushes on timer ticks and when the buffer fills.
- On shutdown, calls
drainQueue(buffer)and exits cleanly.This ensures no items are lost between tick and shutdown, and keeps the batching goroutine single-responsibility.
321-342: drainQueue correctly flushes remaining items during shutdown
drainQueueempties the queue into the buffer, sending full batches as they accumulate and finally sending any partial batch, with helpful debug logging. GivenacceptTrafficSemais closed beforeshutdownSignal, no new async items should be enqueued during draining, so this is race-free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
router/pkg/config/config_test.go (1)
764-787: LGTM! Precedence test correctly validates file config priority.The test correctly verifies that file-specified exporter values override environment variables, using clearly distinguishable values (1024 vs 9999, 5s vs 99s) to prove the behavior.
Optional: Consider testing the precedence behavior for all four exporter fields (BatchSize, QueueSize, Interval, ExportTimeout) for completeness, though testing two fields is sufficient to prove the precedence mechanism works.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
router/pkg/config/config_test.go(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.
Applied to files:
router/pkg/config/config_test.go
🧬 Code graph analysis (1)
router/pkg/config/config_test.go (1)
router/pkg/config/config.go (5)
LoadConfig(1120-1232)Config(1014-1089)Telemetry(157-163)Metrics(137-142)Prometheus(99-113)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: build-router
- GitHub Check: image_scan
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_test
- GitHub Check: build_push_image
- GitHub Check: integration_test (./telemetry)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (3)
router/pkg/config/config_test.go (3)
689-707: LGTM! Well-structured test for exporter defaults.The test correctly verifies that exporter configuration falls back to sensible defaults (8192/16384 for batch/queue sizes, 10s for interval/timeout) when only
enabled: trueis specified.
709-736: LGTM! Comprehensive test for custom file configuration.The test correctly verifies that all exporter fields (batch_size, queue_size, interval, export_timeout) can be customized via YAML configuration.
738-762: LGTM! Proper environment variable loading test.The test correctly verifies that exporter configuration can be loaded from environment variables using the
PROMETHEUS_SCHEMA_FIELD_USAGE_EXPORTER_*prefix pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
router/pkg/config/testdata/config_defaults.json(1 hunks)router/pkg/config/testdata/config_full.json(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/pkg/config/testdata/config_defaults.json
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: build-router
- GitHub Check: build_push_image
- GitHub Check: build_push_image (nonroot)
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_test
- GitHub Check: image_scan
- GitHub Check: integration_test (./telemetry)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (4)
router/pkg/config/testdata/config_full.json (2)
93-102: Schema change is properly tested via fixture-based tests.The config_full.json file is exercised by config parsing tests in
router/pkg/config/config_test.goat lines 320 and 1171. These tests use goldie for fixture-based JSON comparison: they load the YAML configuration and validate the parsed output matches theconfig_full.jsonfixture exactly. Since the fixture already contains the newExporterschema withBatchSize,QueueSize,Interval, andExportTimeoutfields, the goldie tests will automatically catch any schema mismatches. If tests are passing, the schema change is validated.
96-101: Configuration field names correctly match Go struct definition.Verification confirms the Exporter configuration in
config_full.json(lines 96-101) is properly structured:
- JSON field names (
BatchSize,QueueSize,Interval,ExportTimeout) match Go struct field names inrouter/pkg/config/config.go:121-126- Field values are correct (integers for int fields, nanoseconds for time.Duration fields)
- Both
config_full.jsonandconfig_defaults.jsonconsistently reflect the schema change- Tests and production code in
router/core/router.go:2327-2330correctly use these fieldsrouter/pkg/config/testdata/config_defaults.json (2)
63-72: Structural change from SampleRate to Exporter is sound.The replacement of the probabilistic
SampleRatesampling approach with a configurableExporterobject containing BatchSize, QueueSize, Interval, and ExportTimeout properly reflects the PR objective to track all operations deterministically using batching instead of sampling.
66-71: Testdata values are correct and consistent with config schema defaults.Verification confirms the Exporter values in the testdata file align with the actual configuration schema. The
router/pkg/config/config.gostruct definesPrometheusSchemaFieldUsageExporterwithenvDefaulttags that match exactly:
BatchSize: envDefault:"8192"(testdata: 8192) ✓QueueSize: envDefault:"16384"(testdata: 16384) ✓Interval: envDefault:"10s"(testdata: 10000000000 ns = 10s) ✓ExportTimeout: envDefault:"10s"(testdata: 10000000000 ns = 10s) ✓The testdata correctly reflects the source-of-truth defaults. Any discrepancy with the PR description likely reflects earlier proposal values that were finalized to these values in the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
router/internal/graphqlmetrics/prometheus_sink.go (1)
81-86: Consider passingnilinstead of an empty slice for the unused parameter.Since attributes are being passed via
WithAttributeSetoption, thesliceAttrparameter is unused. Passingnilinstead of[]attribute.KeyValue{}would be more idiomatic and avoid an unnecessary empty slice allocation.Apply this diff:
s.metricStore.MeasureSchemaFieldUsage( ctx, int64(count), - []attribute.KeyValue{}, + nil, otelmetric.WithAttributeSet(attribute.NewSet(allAttrs...)), )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
router/internal/graphqlmetrics/prometheus_sink.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
Store(149-162)router/pkg/otel/attributes.go (5)
WgOperationName(10-10)WgOperationType(11-11)WgOperationSha256(59-59)WgGraphQLFieldName(60-60)WgGraphQLParentType(61-61)router/pkg/graphqlschemausage/schemausage.go (1)
TypeFieldMetrics(28-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: build-router
- GitHub Check: build_push_image (nonroot)
- GitHub Check: image_scan (nonroot)
- GitHub Check: image_scan
- GitHub Check: build_push_image
- GitHub Check: build_test
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (1)
router/internal/graphqlmetrics/prometheus_sink.go (1)
127-130: No change needed — parent type and field-name extraction are correct.TypeFieldUsageInfo.TypeNames is populated from ParentTypeNames such that index 0 is the immediate parent type, and Path contains the path segments with the field name as the last element. Verified in router/pkg/graphqlschemausage/schemausage.go (ParentTypeNames -> TypeNames), the protobuf comment (graphqlmetrics/gen/proto/.../graphqlmetrics.pb.go), and tests (router/internal/graphqlmetrics/exporter_test.go, aggregate_test.go) showing examples like Path ["user","name"] with TypeNames ["User","String"].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
router/pkg/config/config.go (1)
116-119: Back-compat: removal of sample_rate needs migration story or explicit deprecation notice.Dropping SampleRate in favor of Exporter will break existing configs. Either:
- accept the breaking change and document a clear migration; or
- add a temporary compatibility path that maps sample_rate to exporter settings.
Also ensure any references to SampleRate are removed across code, tests, docs. Reusing the schema’s deprecation mechanism can help guide users.
#!/bin/bash # Find lingering references (code, tests, docs) rg -n -C2 -S -i '\bsample_rate\b|SampleRate|schema_usage\.sample_rate|PROMETHEUS_SCHEMA_FIELD_USAGE_SAMPLE_RATE'
🧹 Nitpick comments (2)
router/pkg/config/config.go (1)
121-126: Type naming duplication and defaults: consider renaming and confirm queue_size default.
- PrometheusSchemaFieldUsageExporter is also declared in router/pkg/metric/config.go. Keeping the same name in two packages is legal but confusing. Consider renaming this one to PrometheusSchemaFieldUsageExporterConfig or PrometheusSchemaFieldUsageSettings to reduce ambiguity.
- Default queue_size here is 12800, while the PR example shows 10240. Please reconcile or update examples to avoid drift.
router/pkg/config/config.schema.json (1)
1249-1281: Harden exporter schema: disallow unknown keys and (optionally) bound sizes.Add additionalProperties: false under schema_usage.exporter to catch typos (e.g., “batchsize”). Optionally add conservative maximums to bound memory use.
- "exporter": { - "type": "object", - "description": "Configuration for the schema usage exporter", + "exporter": { + "type": "object", + "description": "Configuration for the schema usage exporter", + "additionalProperties": false, "properties": { "batch_size": { "type": "integer", "default": 4096, "minimum": 1, + "maximum": 1048576, "description": "The maximum number of schema usage items to be applied in a single batch. The default value is 4096." }, "queue_size": { "type": "integer", "default": 12800, "minimum": 1, + "maximum": 10485760, "description": "The maximum number of schema usage items allowed in queue at a given time. The default value is 12800." },
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
router/pkg/config/config.go(1 hunks)router/pkg/config/config.schema.json(1 hunks)router/pkg/config/config_test.go(1 hunks)router/pkg/config/testdata/config_defaults.json(1 hunks)router/pkg/config/testdata/config_full.json(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- router/pkg/config/testdata/config_defaults.json
- router/pkg/config/config_test.go
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.
Applied to files:
router/pkg/config/testdata/config_full.json
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/pkg/config/testdata/config_full.jsonrouter/pkg/config/config.schema.json
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.
Applied to files:
router/pkg/config/testdata/config_full.json
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
Applied to files:
router/pkg/config/config.schema.json
🧬 Code graph analysis (1)
router/pkg/config/config.go (3)
router/internal/graphqlmetrics/exporter.go (1)
Exporter(16-31)router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)router-tests/testenv/testenv.go (1)
PrometheusSchemaFieldUsageExporter(285-290)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: build-router
- GitHub Check: image_scan (nonroot)
- GitHub Check: integration_test (./events)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: build_push_image
- GitHub Check: image_scan
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: build_test
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (1)
router/pkg/config/testdata/config_full.json (1)
96-101: Fixture update aligns with new exporter config.Values and duration encodings look correct. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
router/core/context.go(2 hunks)router/core/router_metrics.go(5 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.
Applied to files:
router/core/context.go
🧬 Code graph analysis (2)
router/core/router_metrics.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
GraphQLMetricsExporter(13-15)router/internal/graphqlmetrics/prometheus_exporter.go (1)
PrometheusMetricsExporter(13-15)router/core/context.go (4)
OperationType(503-503)OperationTypeQuery(506-506)OperationTypeMutation(507-507)OperationTypeSubscription(508-508)graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (16)
OperationType(26-26)OperationType(58-60)OperationType(62-64)OperationType(71-73)OperationType_QUERY(29-29)OperationType_MUTATION(30-30)OperationType_SUBSCRIPTION(31-31)SchemaUsageInfo(130-153)SchemaUsageInfo(168-168)SchemaUsageInfo(183-185)OperationInfo(362-373)OperationInfo(388-388)OperationInfo(403-405)SchemaInfo(428-435)SchemaInfo(450-450)SchemaInfo(465-467)
router/core/context.go (2)
router/pkg/graphqlschemausage/schemausage.go (2)
TypeFieldMetrics(28-28)TypeFieldUsageInfo(40-48)graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (3)
TypeFieldUsageInfo(476-493)TypeFieldUsageInfo(508-508)TypeFieldUsageInfo(523-525)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: build-router
- GitHub Check: build_push_image (nonroot)
- GitHub Check: build_push_image
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_test
- GitHub Check: image_scan
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (8)
router/core/context.go (2)
546-549: Context for new caching field.The addition of
typeFieldUsageInfoMetricsas a cached conversion result aligns with the PR's goal to efficiently track all operations without sampling. This avoids repeated allocations when multiple exporters need the same data.
601-608: The review comment is incorrect and should be disregarded.The premise of a race condition is based on a misunderstanding of the codebase. Here's why:
operationContext is per-request: In
graphql_prehandler.goline 229, a newoperationContextinstance is created fresh for each HTTP request:requestContext.operation = &operationContext{clientInfo: clientInfo}. This is a standard per-request context pattern.No concurrent access to the same instance: Since each request gets its own operationContext instance, different requests cannot race on the same instance's fields. The lazy initialization on lines 604-605 is only accessed by the single request handler processing that specific request.
Confusion between shared data and shared context: The comment in
router_metrics.go:100-102refers to the cached plan data (typeFieldUsageInfo) being reused across requests via the planner cache. This is separate from the operationContext wrapper itself, which is always fresh per request. The two concepts are distinct.Likely an incorrect or invalid review comment.
router/core/router_metrics.go (6)
17-19: LGTM: Clean interface extension for Prometheus support.The interface changes clearly separate GraphQL and Prometheus export paths while maintaining backward compatibility with the updated
GraphQLMetricsExportertype.
26-31: LGTM: Struct fields updated to support dual export paths.The addition of
prometheusMetricsExporteralongside the updatedgqlMetricsExportertype enables the router to export metrics to both GraphQL and Prometheus sinks.
43-52: LGTM: Constructor properly initializes new fields.All new fields are correctly initialized from the config.
57-69: LGTM: Prometheus tracking enabled conditionally.Line 64 correctly sets
PrometheusUsageInfoTrackbased on exporter presence, ensuring tracking overhead is only incurred when the Prometheus exporter is configured.
75-81: LGTM: Accessor methods updated.The accessor methods correctly return the new exporter types.
83-133: LGTM: Efficient use of cached metrics conversion.Line 109 now uses
GetTypeFieldUsageInfoMetrics()which avoids repeated allocations by caching the converted metrics data. This optimization aligns with the PR's goal to track all operations efficiently.Note: See the comment on
context.golines 601-608 regarding potential race conditions in the caching implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
router/pkg/config/config.schema.json (1)
1236-1282: Lock down schema_usage to reject legacy SampleRate and unknown keysCurrently additionalProperties is not set on schema_usage, so legacy sample_rate may be silently accepted and ignored. Tighten it and keep exporter properties as-is.
- "schema_usage": { - "type": "object", - "description": "Configure schema field usage metrics for Prometheus", - "properties": { + "schema_usage": { + "type": "object", + "description": "Configure schema field usage metrics for Prometheus", + "additionalProperties": false, + "properties": { "enabled": {Optionally, add a custom pre-validation check to emit a clearer error if sample_rate is present.
♻️ Duplicate comments (1)
router/core/router_metrics.go (1)
19-19: Consider usingGQLMetricsExporterfor proper Go initialism formatting.As noted in a previous review, Go's style guide recommends using all caps for initialisms like "GQL" rather than "Gql".
🧹 Nitpick comments (10)
router/internal/graphqlmetrics/exporter.go (2)
52-63: Clarify QueueSize commentThe queue holds items (T), not batches. Update the comment to avoid confusion.
382-416: Shutdown sequencing could prolong terminationRequests are canceled only in defer, and backoff sleeps are not cancelable. The diff above cancels earlier and makes sleeps cancelable to avoid long waits on shutdown.
router/pkg/config/config.schema.json (3)
1179-1185: Default mismatch: listen_addrSchema default is "localhost:8088" but code (config.go Line 102) uses "127.0.0.1:8088" and defaults files do too. Align schema default.
- "default": "localhost:8088" + "default": "127.0.0.1:8088"
462-466: Default + description mismatch: websocket.forward_upgrade_query_params.enabledSchema sets default false, description says default true, and code default is true. Align to true.
- "default": false, - "description": "Forward upgrade request query parameters in the extensions payload when starting a subscription on a Subgraph. The default value is true." + "default": true, + "description": "Forward upgrade request query parameters in the extensions payload when starting a subscription on a Subgraph. The default value is true."
633-637: Default mismatch: access_logs.add_stacktraceSchema default true; code default is false (config.go Line 883). Align to false.
- "default": true + "default": falserouter/core/router_metrics.go (2)
140-170: Document the hash field choice.The nil check at lines 141-143 properly addresses the defensive programming concern from the previous review. However, line 159 uses
operationContext.sha256Hash(original operation hash) while the GraphQL export path usesoperationContext.HashString()(normalized hash with variables).According to the previous review analysis, this difference is intentional: Prometheus uses the original hash for operation correlation. Please add a brief comment at line 159 explaining this choice to prevent future confusion.
Apply this diff to document the hash semantics:
item := &graphqlmetricsv1.SchemaUsageInfo{ TypeFieldMetrics: operationContext.GetTypeFieldUsageInfoMetrics(), OperationInfo: &graphqlmetricsv1.OperationInfo{ Type: opType, - Hash: operationContext.sha256Hash, + Hash: operationContext.sha256Hash, // Original operation hash for Prometheus correlation (vs normalized hash used in GraphQL export) // parsed operation names are re-used across requests // for that reason, we need to copy the name, or it might get corrupted Name: strings.Clone(operationContext.name), },
145-153: Consider extracting the repeated operation type mapping.The operation type switch (lines 145-153) is duplicated from the GraphQL export method (lines 93-101). Extracting this into a helper function would improve maintainability.
Example refactor:
func operationTypeToProto(opType OperationType) graphqlmetricsv1.OperationType { switch opType { case OperationTypeQuery: return graphqlmetricsv1.OperationType_QUERY case OperationTypeMutation: return graphqlmetricsv1.OperationType_MUTATION case OperationTypeSubscription: return graphqlmetricsv1.OperationType_SUBSCRIPTION default: return graphqlmetricsv1.OperationType_QUERY } }Then use
opType := operationTypeToProto(operationContext.opType)in both export methods.router/pkg/graphqlschemausage/schemausage.go (3)
64-104: Type field visitor preallocation and path handling look correctPreallocating
typeFieldUsageInfowith a small fixed capacity and switching topathCopy := make(len(path)+1)pluscopyavoids the extra allocation ofslices.Clone(append(...))while still guaranteeing that:
path == nilis handled safely (len(path)is 0,copyis no-op).- Each
pathCopyis independent and only shared intentionally between the direct field entry and itsIndirectInterfaceFieldvariant.- Recursion passes
pathCopydown and always re-allocates for children, so parent paths are never mutated.The constants (capacity 32 and per-field path allocations) are modest and appropriate for typical query shapes; if future workloads show much larger field counts, benchmarks in this package can guide tuning, but there’s no correctness or obvious perf risk here.
107-131: Argument usage slice preallocation is safe and aligns with usage patternInitializing
visitor.usagewithmake([]*graphqlmetrics.ArgumentUsageInfo, 0, 16)is a straightforward allocation win for common cases and doesn’t change behavior; all appends still produce the same sequence of entries as before, and the walker lifecycle remains unchanged.If production traces show higher argument counts per operation, you can revisit the
16heuristic, but as-is this is a low-risk, localized improvement.
159-244: Input usage preallocation and enum handling changes preserve semanticsThe new preallocations in input usage look good:
usage: make([]*graphqlmetrics.InputUsageInfo, 0, 16)inGetInputUsageInfomatches the typical small number of variables and reduces slice growth.- In
traverseVariable, settingusageInfo.Path = []string{parentTypeName, fieldName}whenparentTypeName != ""allocates an exact-length slice and is equivalent to the previous append-based construction, while avoiding extra capacity and potential future aliasing.- For enum arrays,
usageInfo.EnumValues = make([]string, len(arr))followed by index assignment is correct and avoids repeatedappendcalls.These changes don’t alter how
appendUniqueUsageandinfoEqualsbehave, so deduplication semantics remain the same while allocations are slightly reduced.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
router/core/router_metrics.go(5 hunks)router/internal/graphqlmetrics/exporter.go(5 hunks)router/internal/graphqlmetrics/exporter_bench_test.go(1 hunks)router/pkg/config/config.go(1 hunks)router/pkg/config/config.schema.json(1 hunks)router/pkg/config/testdata/config_defaults.json(1 hunks)router/pkg/config/testdata/config_full.json(1 hunks)router/pkg/graphqlschemausage/schemausage.go(6 hunks)router/pkg/graphqlschemausage/schemausage_bench_test.go(1 hunks)
🧰 Additional context used
🧠 Learnings (13)
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/pkg/config/testdata/config_defaults.jsonrouter/pkg/config/config.schema.jsonrouter/core/router_metrics.gorouter/pkg/config/testdata/config_full.jsonrouter/pkg/graphqlschemausage/schemausage.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
Applied to files:
router/pkg/config/config.schema.jsonrouter/core/router_metrics.go
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-09-02T12:52:27.677Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function at lines 571-578, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the computed query hash before any APQ operations occur. There's also a test case that verifies this behavior.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.
Applied to files:
router/core/router_metrics.gorouter/pkg/config/testdata/config_full.json
📚 Learning: 2025-09-02T12:52:27.677Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the query body before any APQ operations occur.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-12T13:50:45.964Z
Learnt from: Noroth
Repo: wundergraph/cosmo PR: 2132
File: router-plugin/plugin.go:139-146
Timestamp: 2025-08-12T13:50:45.964Z
Learning: In the Cosmo router plugin system, the plugin framework creates its own logger independently. When creating a logger in NewRouterPlugin, it's only needed for the gRPC server setup (passed via setup.GrpcServerInitOpts.Logger) and doesn't need to be assigned back to serveConfig.Logger.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.
Applied to files:
router/pkg/config/testdata/config_full.json
🧬 Code graph analysis (6)
router/internal/graphqlmetrics/exporter.go (1)
router/internal/graphqlmetrics/sink.go (2)
Sink(9-18)SinkErrorHandler(23-23)
router/pkg/config/config.go (2)
router/internal/graphqlmetrics/exporter.go (1)
Exporter(17-33)router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)
router/core/router_metrics.go (5)
router/core/operation_metrics.go (1)
OperationMetrics(31-40)router/internal/graphqlmetrics/graphql_exporter.go (1)
GraphQLMetricsExporter(13-15)router/internal/graphqlmetrics/prometheus_exporter.go (1)
PrometheusMetricsExporter(13-15)router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (10)
OperationType(26-26)OperationType(58-60)OperationType(62-64)OperationType(71-73)OperationType_QUERY(29-29)OperationType_MUTATION(30-30)OperationType_SUBSCRIPTION(31-31)SchemaUsageInfo(130-153)SchemaUsageInfo(168-168)SchemaUsageInfo(183-185)router/core/context.go (4)
OperationType(503-503)OperationTypeQuery(506-506)OperationTypeMutation(507-507)OperationTypeSubscription(508-508)
router/internal/graphqlmetrics/exporter_bench_test.go (1)
router/internal/graphqlmetrics/exporter.go (3)
ExporterSettings(52-63)RetryOptions(35-40)NewExporter(84-120)
router/pkg/graphqlschemausage/schemausage_bench_test.go (2)
router/pkg/graphqlschemausage/schemausage_test.go (1)
FakeFactory(527-529)router/pkg/graphqlschemausage/schemausage.go (4)
GetTypeFieldUsageInfo(14-23)GetArgumentUsageInfo(107-124)GetInputUsageInfo(159-171)TypeFieldMetrics(26-26)
router/pkg/graphqlschemausage/schemausage.go (1)
graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (9)
TypeFieldUsageInfo(476-493)TypeFieldUsageInfo(508-508)TypeFieldUsageInfo(523-525)ArgumentUsageInfo(569-582)ArgumentUsageInfo(597-597)ArgumentUsageInfo(612-614)InputUsageInfo(644-659)InputUsageInfo(674-674)InputUsageInfo(689-691)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: build-router
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: integration_test (./events)
- GitHub Check: image_scan (nonroot)
- GitHub Check: image_scan
- GitHub Check: integration_test (./telemetry)
- GitHub Check: build_test
- GitHub Check: build_push_image
- GitHub Check: build_push_image (nonroot)
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: build_test
🔇 Additional comments (10)
router/internal/graphqlmetrics/exporter_bench_test.go (1)
27-66: Benchmarks cover key hot paths; LGTMGood coverage of buffer pooling, high-throughput, batch cycling, growth, and parallel Record. Minor: time.Sleep in benchmarks can add variance, but acceptable here.
Also applies to: 68-111, 112-156, 157-197, 198-242
router/pkg/config/testdata/config_defaults.json (1)
66-71: Defaults alignment looks goodExporter defaults match code/env defaults (4096/12800/1s/10s). LGTM.
router/pkg/config/testdata/config_full.json (1)
96-101: Config shape migration verifiedExporter object replaces SampleRate as intended; values consistent with defaults/examples. LGTM.
router/pkg/config/config.go (1)
116-119: ****The two
PrometheusSchemaFieldUsageExporterdefinitions serve distinct purposes and should not be unified:
router/pkg/config/config.go:121is a configuration struct with YAML and environment variable tags for unmarshaling external configuration with defaultsrouter/pkg/metric/config.go:48is an internal runtime struct used by the metrics package to track exporter state, with no serialization concernsThis separation of concerns is intentional and appropriate. Consolidating them would couple the metric package to configuration unmarshaling logic and add unnecessary struct tags where they don't belong. The apparent duplication is justified.
Likely an incorrect or invalid review comment.
router/core/router_metrics.go (4)
3-5: Good use of standard library.The
stringspackage import supports the replacement of the customstrCopyimplementation withstrings.Clone(), which is a cleaner approach.
24-53: LGTM!The struct fields and constructor are updated consistently. The new
prometheusMetricsExporterfield is properly initialized inNewRouterMetrics.
76-82: LGTM!The getter methods correctly expose both exporters through the
RouterMetricsinterface.
84-138: LGTM!The method improvements are well-executed:
- Added defensive nil check for the exporter (lines 89-91)
- Replaced custom
strCopywith standardstrings.Clone(line 122)- Updated to use
GetTypeFieldUsageInfoMetrics()for cleaner data accessrouter/pkg/graphqlschemausage/schemausage_bench_test.go (2)
106-194: Benchmarks are well-structured and focus on the right hot pathsEach benchmark:
- Calls
setupBenchmarkonce, then usesb.ResetTimer()andb.ReportAllocs()to keep setup out of the measured section.- Exercises a single concern per benchmark (
GetTypeFieldUsageInfo,GetArgumentUsageInfo,GetInputUsageInfo,IntoGraphQLMetrics, and the combined end-to-end flow), with errors surfaced viab.Fatal.- Prevents compiler optimizations by assigning results to a blank variable.
The end-to-end benchmark intentionally reuses the precomputed plan, docs, and variables to focus on schema usage extraction rather than planning, which matches the PR’s perf focus. This all looks good and should give you stable allocation/CPU signals for regression tracking.
19-104: No issues found; original concern is unfounded.The pattern of calling
astjson.ParseBytes(op.Input.Variables)is safe and well-established. TheVariablesfield originates asjson.RawMessage(a[]bytetype withomitemptytag), which defaults to an empty slice rather than nil when unmarshaled. This empty slice is valid JSON input forastjson.ParseBytes. The identical pattern already exists and is tested inschemausage_test.go(line 206) with successful results, confirming the code path is sound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
router/internal/graphqlmetrics/exporter_test.go (1)
70-71: Minor: prefer require.NoError for errorsSwitch to require.NoError for clarity. Applies to all similar occurrences.
- require.Nil(t, err) + require.NoError(t, err)Also applies to: 149-150, 188-189, 222-223, 345-346, 421-422
♻️ Duplicate comments (3)
router/internal/graphqlmetrics/exporter.go (3)
56-57: Doc nit: QueueSize counts items, not batchesThe queue holds items (chan T), not batches. Update comment for accuracy.
- // QueueSize is the maximum number of batches allowed in queue at a given time. + // QueueSize is the maximum number of items allowed in the queue at a given time.
272-290: Backoff sleep not cancelable; goroutines may linger on shutdownMake backoff waits respect e.exportRequestContext so Shutdown can exit promptly.
- // Wait for the specified backoff period - sleepDuration := b.Duration() - e.logger.Debug("Retrying export after backoff", + // Wait for the specified backoff period (cancelable) + sleepDuration := b.Duration() + e.logger.Debug("Retrying export after backoff", zap.Int("batch_size", len(batch)), zap.Int("retry", retry), zap.Duration("sleep", sleepDuration), ) - - time.Sleep(sleepDuration) + select { + case <-e.exportRequestContext.Done(): + e.logger.Debug("Export cancelled during backoff") + return + case <-time.After(sleepDuration): + }
381-394: Shutdown can hang; cancel exports earlierCancel export requests before waiting, not in defer, so sleeping/blocked exports exit promptly.
func (e *Exporter[T]) Shutdown(ctx context.Context) error { e.logger.Debug("Shutdown started") ticker := time.NewTicker(time.Millisecond * 100) - defer func() { - ticker.Stop() - // Cancel all export requests - e.cancelAllExportRequests() - // Close the sink - if err := e.sink.Close(ctx); err != nil { - e.logger.Error("Error closing sink", zap.Error(err)) - } - e.logger.Debug("Shutdown complete") - }() + defer func() { + ticker.Stop() + // Close the sink + if err := e.sink.Close(ctx); err != nil { + e.logger.Error("Error closing sink", zap.Error(err)) + } + e.logger.Debug("Shutdown complete") + }() // First close the acceptTrafficSema to stop accepting new items close(e.acceptTrafficSema) // Then trigger the shutdown signal for the exporter goroutine to stop // It will drain the queue and send the remaining items close(e.shutdownSignal) + // Cancel all export requests to unblock backoff/sinks started before shutdown + e.cancelAllExportRequests()
🧹 Nitpick comments (5)
router/internal/graphqlmetrics/prometheus_sink.go (1)
104-106: Use int64 for counts to match metric API and avoid unnecessary castsStore counts as int64; remove per-iteration int conversions.
-func (s *PrometheusSink) aggregateBatch(batch []*graphqlmetrics.SchemaUsageInfo) map[aggregatedUsageKey]int { - aggregatedCounts := make(map[aggregatedUsageKey]int) +func (s *PrometheusSink) aggregateBatch(batch []*graphqlmetrics.SchemaUsageInfo) map[aggregatedUsageKey]int64 { + aggregatedCounts := make(map[aggregatedUsageKey]int64) @@ - // Increment count, using field.Count if available, otherwise 1 - if field.Count > 0 { - aggregatedCounts[key] += int(field.Count) - } else { - aggregatedCounts[key]++ - } + // Increment count, using field.Count if available, otherwise 1 + if field.Count > 0 { + aggregatedCounts[key] += int64(field.Count) + } else { + aggregatedCounts[key] += 1 + }And in Export:
- for key, count := range aggregatedCounts { + for key, count := range aggregatedCounts { @@ - int64(count), + count,Also applies to: 140-145, 81-87
router/demo.config.yaml (1)
16-25: LGTM: concise demo of Prometheus schema_usageSample config is coherent. Optional: add a comment noting include_operation_sha increases label cardinality.
router/internal/graphqlmetrics/exporter_test.go (1)
385-391: Reduce flakiness: prefer Eventually over fixed SleepReplace fixed sleep with Eventually to wait for flush deterministically.
- time.Sleep(200 * time.Millisecond) - - defer require.Nil(t, e.Shutdown(context.Background())) - - require.Equal(t, 1, len(c.publishedAggregations)) - require.Equal(t, 5, len(c.publishedAggregations[0])) + require.Eventually(t, func() bool { + c.mu.Lock() + defer c.mu.Unlock() + return len(c.publishedAggregations) == 1 && len(c.publishedAggregations[0]) == 5 + }, 2*time.Second, 50*time.Millisecond) + defer require.NoError(t, e.Shutdown(context.Background()))router/pkg/config/config.schema.json (1)
1249-1281: Tighten exporter schema and fix unit docDisallow unknown keys and include 'ms' in interval docs to match usage.
- "exporter": { - "type": "object", - "description": "Configuration for the schema usage exporter", - "properties": { + "exporter": { + "type": "object", + "description": "Configuration for the schema usage exporter", + "additionalProperties": false, + "properties": { @@ - "interval": { + "interval": { "type": "string", "default": "2s", "duration": { "minimum": "100ms" }, - "description": "The interval at which the schema usage queue is flushed. The period is specified as a string with a number and a unit, e.g. 10s, 1m. The supported units are 's', 'm', 'h'." + "description": "The interval at which the schema usage queue is flushed. Specify a duration, e.g. 100ms, 10s, 1m. Supported units: 'ms', 's', 'm', 'h'." },router/internal/graphqlmetrics/exporter.go (1)
107-114: *Prefer pooling []T (not []T) to simplify and avoid pointer indirectionSafer and simpler: pool raw slices; ownership already fixed by deferring put in goroutine.
- batchBufferPool: &sync.Pool{ - New: func() any { - // Pre-allocate slice with batch size capacity - buffer := make([]T, 0, settings.BatchSize) - return &buffer - }, - }, + batchBufferPool: &sync.Pool{ + New: func() any { + return make([]T, 0, settings.BatchSize) + }, + }, @@ -func (e *Exporter[T]) getBatchBuffer() []T { - bufferPtr := e.batchBufferPool.Get().(*[]T) - buffer := *bufferPtr - // Ensure the buffer is empty (should already be, but be defensive) - return buffer[:0] -} +func (e *Exporter[T]) getBatchBuffer() []T { + bufAny := e.batchBufferPool.Get() + if bufAny == nil { + return make([]T, 0, e.settings.BatchSize) + } + buf := bufAny.([]T) + return buf[:0] +} @@ -func (e *Exporter[T]) putBatchBuffer(buffer []T) { - // Reset the slice to zero length while keeping capacity - buffer = buffer[:0] - e.batchBufferPool.Put(&buffer) -} +func (e *Exporter[T]) putBatchBuffer(buf []T) { + e.batchBufferPool.Put(buf[:0]) +}Also applies to: 154-161, 163-169
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
router/demo.config.yaml(2 hunks)router/internal/graphqlmetrics/exporter.go(5 hunks)router/internal/graphqlmetrics/exporter_test.go(10 hunks)router/internal/graphqlmetrics/prometheus_sink.go(1 hunks)router/pkg/config/config.go(1 hunks)router/pkg/config/config.schema.json(1 hunks)router/pkg/config/config_test.go(1 hunks)router/pkg/config/testdata/config_defaults.json(1 hunks)router/pkg/config/testdata/config_full.json(1 hunks)router/pkg/graphqlschemausage/schemausage.go(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- router/pkg/config/config_test.go
- router/pkg/graphqlschemausage/schemausage.go
- router/pkg/config/testdata/config_full.json
- router/pkg/config/testdata/config_defaults.json
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.
Applied to files:
router/internal/graphqlmetrics/exporter_test.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
Applied to files:
router/pkg/config/config.schema.json
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/pkg/config/config.schema.json
🧬 Code graph analysis (4)
router/internal/graphqlmetrics/exporter_test.go (2)
router/internal/graphqlmetrics/graphql_exporter.go (1)
NewGraphQLMetricsExporter(19-39)router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (3)
RequestInfo(75-82)RequestInfo(97-97)RequestInfo(112-114)
router/pkg/config/config.go (2)
router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)router-tests/testenv/testenv.go (1)
PrometheusSchemaFieldUsageExporter(285-290)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
Store(149-162)router/pkg/otel/attributes.go (5)
WgOperationName(10-10)WgOperationType(11-11)WgOperationSha256(59-59)WgGraphQLFieldName(60-60)WgGraphQLParentType(61-61)router/pkg/graphqlschemausage/schemausage.go (1)
TypeFieldMetrics(26-26)
router/internal/graphqlmetrics/exporter.go (1)
router/internal/graphqlmetrics/sink.go (2)
Sink(9-18)SinkErrorHandler(23-23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: build-router
- GitHub Check: build_push_image
- GitHub Check: integration_test (./events)
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: image_scan
- GitHub Check: build_test
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (2)
router/pkg/config/config.go (1)
116-119: LGTM: exporter config shape and defaultsExporter block and env tags look consistent with schema and intended wiring.
Also applies to: 121-126
router/internal/graphqlmetrics/prometheus_sink.go (1)
121-131: Avoid potential nil deref on field entriesTypeFieldMetrics is a slice of pointers; a nil element will panic on Path/TypeNames access.
- for _, field := range usageInfo.TypeFieldMetrics { + for _, field := range usageInfo.TypeFieldMetrics { + if field == nil { + continue + } // Skip fields without valid parent type or path if len(field.Path) == 0 || len(field.TypeNames) < 1 { continue }⛔ Skipped due to learnings
Learnt from: SkArchon Repo: wundergraph/cosmo PR: 2137 File: router/pkg/pubsub/redis/adapter.go:114-116 Timestamp: 2025-08-14T16:47:03.744Z Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
0cf3876 to
307df0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
router/internal/exporter/exporter.go (2)
26-33: Tighten shutdown/backoff behavior and simplify the batch buffer poolTwo areas are worth tightening up here:
- Shutdown responsiveness under long backoff
exportBatchWithRetryusestime.Sleep(sleepDuration)in the retry loop, andShutdownonly callscancelAllExportRequests()in adeferafter waiting forinflightBatchesto reach 0. Under largeMaxDurationor slow sinks, shutdown can be substantially delayed because those sleeps are not interruptible.- You can make retries respect shutdown by:
- Cancelling
exportRequestContextat the start ofShutdown(not only in the defer).- Making the backoff wait cancelable:
func (e *Exporter[T]) exportBatchWithRetry(batch []T) { @@ - for retry < e.settings.RetryOptions.MaxRetry { + for retry < e.settings.RetryOptions.MaxRetry { retry++ @@ - // Wait for the specified backoff period - sleepDuration := b.Duration() - e.logger.Debug("Retrying export after backoff", + // Wait for the specified backoff period (cancelable) + sleepDuration := b.Duration() + e.logger.Debug("Retrying export after backoff", zap.Int("batch_size", len(batch)), zap.Int("retry", retry), zap.Duration("sleep", sleepDuration), ) - - time.Sleep(sleepDuration) + select { + case <-e.exportRequestContext.Done(): + e.logger.Debug("Export cancelled during backoff") + return + case <-time.After(sleepDuration): + } @@ func (e *Exporter[T]) Shutdown(ctx context.Context) error { e.logger.Debug("Shutdown started") - ticker := time.NewTicker(time.Millisecond * 100) - defer func() { - ticker.Stop() - // Cancel all export requests - e.cancelAllExportRequests() + // Cancel export requests early to unblock long-running exports/backoff + e.cancelAllExportRequests() + + ticker := time.NewTicker(time.Millisecond * 100) + defer func() { + ticker.Stop() // Close the sink if err := e.sink.Close(ctx); err != nil { e.logger.Error("Error closing sink", zap.Error(err)) } e.logger.Debug("Shutdown complete") }()
- Batch buffer pool ergonomics
- The current pool stores
*[]Tand re-wraps slice headers viagetBatchBuffer/putBatchBuffer. This works, but it’s less idiomatic and slightly harder to reason about than pooling[]Tvalues directly.- Consider switching the pool to hold
[]Tvalues and dropping the pointer dance:- batchBufferPool *sync.Pool // Pool for batch slice buffers to reduce allocations + batchBufferPool *sync.Pool // Pool of []T to reduce allocations @@ - batchBufferPool: &sync.Pool{ - New: func() any { - // Pre-allocate slice with batch size capacity - buffer := make([]T, 0, settings.BatchSize) - return &buffer - }, - }, + batchBufferPool: &sync.Pool{ + New: func() any { + // Pre-allocate slice with batch size capacity + return make([]T, 0, settings.BatchSize) + }, + }, @@ func (e *Exporter[T]) getBatchBuffer() []T { - bufferPtr := e.batchBufferPool.Get().(*[]T) - buffer := *bufferPtr - // Ensure the buffer is empty (should already be, but be defensive) - return buffer[:0] + bufAny := e.batchBufferPool.Get() + if bufAny == nil { + return make([]T, 0, e.settings.BatchSize) + } + buf := bufAny.([]T) + return buf[:0] } @@ func (e *Exporter[T]) putBatchBuffer(buffer []T) { - // Reset the slice to zero length while keeping capacity - buffer = buffer[:0] - e.batchBufferPool.Put(&buffer) + e.batchBufferPool.Put(buffer[:0]) } @@ func (e *Exporter[T]) prepareAndSendBatch(batch []T) { @@ - go func() { + go func(b []T) { defer e.inflightBatches.Dec() - defer e.putBatchBuffer(batch) // Return buffer to pool after export completes - e.exportBatchWithRetry(batch) - }() + defer e.putBatchBuffer(b) // Return buffer to pool after export completes + e.exportBatchWithRetry(b) + }(batch)This keeps the concurrent semantics you already fixed (buffer returned only after export completes) while making shutdown behavior more predictable and the pool implementation easier to follow.
Based on learnings
Also applies to: 107-114, 154-170, 230-310, 377-413
122-152: Gate retry validation behindRetryOptions.Enabledto allow clean disablementAs written,
validate()rejects zero/negativeMaxDuration,Interval, andMaxRetryeven when retries are disabled. This makes common patterns like “enabled=falsewith zero-valued retry fields” invalid, and it diverges from the repo’s existing behavior of not over-validating options that are turned off (e.g., retry transport config). Consider only enforcing these constraints when retries are actually enabled.func (e *Exporter[T]) validate() error { @@ - if e.settings.ExportTimeout <= 0 { - return fmt.Errorf("export timeout must be positive") - } - - if e.settings.RetryOptions.MaxDuration <= 0 { - return fmt.Errorf("retry max duration must be positive") - } - - if e.settings.RetryOptions.Interval <= 0 { - return fmt.Errorf("retry interval must be positive") - } - - if e.settings.RetryOptions.MaxRetry <= 0 { - return fmt.Errorf("retry max retry must be positive") - } + if e.settings.ExportTimeout <= 0 { + return fmt.Errorf("export timeout must be positive") + } + + if e.settings.RetryOptions.Enabled { + if e.settings.RetryOptions.MaxDuration <= 0 { + return fmt.Errorf("retry max duration must be positive") + } + if e.settings.RetryOptions.Interval <= 0 { + return fmt.Errorf("retry interval must be positive") + } + if e.settings.RetryOptions.MaxRetry <= 0 { + return fmt.Errorf("retry max retry must be positive") + } + } @@ return nil }This lets config set
enabled: falsewith zero defaults without tripping validation, while still protecting enabled retries from bad values.Based on learnings
router/internal/graphqlmetrics/prometheus_sink.go (1)
39-45: Guard against nilLoggerinNewPrometheusSinkto avoid panics
cfg.Logger.With(...)will panic ifcfg.Loggeris nil. This was already raised in a previous review and is still present.Initialize a local logger with a
zap.NewNop()fallback before callingWith:func NewPrometheusSink(cfg PrometheusSinkConfig) *PrometheusSink { - return &PrometheusSink{ - metricStore: cfg.MetricStore, - logger: cfg.Logger.With(zap.String("component", "prometheus_sink")), - includeOpSha: cfg.IncludeOpSha, - } + logger := cfg.Logger + if logger == nil { + logger = zap.NewNop() + } + return &PrometheusSink{ + metricStore: cfg.MetricStore, + logger: logger.With(zap.String("component", "prometheus_sink")), + includeOpSha: cfg.IncludeOpSha, + } }router/core/router_metrics.go (1)
140-170: Prometheus schema usage export path looks correct; consider documenting hash choice.The Prometheus-specific exporter mirrors the GraphQL path, is defensively guarded on
prometheusMetricsExporter == nil, and usesGetTypeFieldUsageInfoMetricsplusstrings.Cloneto avoid mutating shared data. UsingoperationContext.sha256Hashhere (vs.HashString()in the GraphQL exporter) aligns with the distinction between original vs. normalized hashes; adding a short comment to that effect would make the intentional difference clearer to future readers.- Hash: operationContext.sha256Hash, + // Use the original operation hash for Prometheus correlation; GraphQL exporter uses the + // normalized analytics hash via HashString(). + Hash: operationContext.sha256Hash,
🧹 Nitpick comments (6)
router/pkg/metric/config.go (1)
48-53: Consider consolidating duplicate type definitions.The
PrometheusSchemaFieldUsageExportertype is defined identically in bothrouter/pkg/configandrouter/pkg/metricpackages (and also inrouter-tests/testenv). While this separation may be intentional for package isolation, it creates maintenance overhead when fields need to be updated.Consider:
- Defining the type once in a shared package
- Or using type aliases to reference a single source of truth
This is a minor maintainability concern and can be addressed in a future refactor if desired.
router/pkg/config/config.schema.json (1)
1249-1280: Clarify exporter.interval units and queue_size semantics in schema descriptionThe exporter config itself looks correct and matches
PrometheusSchemaFieldUsageExporter, but the descriptions are slightly misleading:
- Line 1269 sets a minimum of
"100ms"while the description lists supported units as only's', 'm', 'h'. Either mention'ms'as supported or raise the minimum to1sto match the text.- Line 1263 describes
queue_sizeas “number of schema usage items allowed in queue”; in the generic exporter,QueueSizeis actually the number of batches in the internal queue, each up tobatch_size. Consider rephrasing to “batches” (or “batches of up to batch_size items”) so users can reason about memory/cardinality correctly.router-tests/prometheus_improved_test.go (2)
82-84: Consider replacing fixedtime.Sleepwith a poll/helper to reduce flakinessAll these tests wait for exporter flushes using fixed sleeps (200ms or 100ms). With parallel subtests and CI variance, this can intermittently be too short or unnecessarily long.
Consider a small helper that polls
promRegistry.Gather()until the expected metric family appears or a timeout elapses, instead of relying on fixedSleepdurations. This will make the tests more robust to timing jitter while still exercising the exporter behavior.Also applies to: 166-168, 229-231, 315-317, 359-361, 413-415
177-199: Ordering assumptions inmultiple usages of field are countedmay be brittleThis test assumes a specific ordering of
schemaUsageMetricswhen asserting on[0],[1],[2]. With the new aggregation path and Prometheus sink, the series order is effectively an implementation detail and may change.To avoid order-dependent flakiness, consider locating the three metrics by
(field_name, parent_type)label pairs (similar to thefieldTypePairsmap used in the first subtest) and asserting the counters on those, instead of relying on positions 0/1/2.router/core/router.go (1)
33-33: GraphQL metrics exporter wiring looks correct; consider future configurability.Using
graphqlmetrics.NewGraphQLMetricsExporterwithexporter.NewDefaultExporterSettings()is a clean migration to the generic exporter and keeps behavior predictable. If you later want to tune batch/queue/interval per deployment, this spot is the natural hook to pass settings derived from config instead of the hardcoded defaults.Also applies to: 822-833
router/demo.config.yaml (1)
8-25: Demo config matches new telemetry and schema_usage model.The added
graphql_metrics,traffic_shaping.all.retry, andtelemetry.metrics.{otlp,prometheus.schema_usage}sections are consistent with the updated config structs and MetricConfigFromTelemetry wiring; enablingprometheus.schema_usagewithotlp.enabled: falseclearly showcases the new Prometheus sink path. You might optionally add commented-outbatch_size,queue_size, andexport_timeoutunderschema_usage.exporterto advertise those knobs in the demo file.Also applies to: 40-40
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (26)
router-tests/prometheus_improved_test.go(16 hunks)router-tests/testenv/testenv.go(3 hunks)router/core/context.go(2 hunks)router/core/graph_server.go(4 hunks)router/core/operation_metrics.go(3 hunks)router/core/router.go(4 hunks)router/core/router_config.go(1 hunks)router/core/router_metrics.go(5 hunks)router/demo.config.yaml(2 hunks)router/internal/exporter/exporter.go(1 hunks)router/internal/exporter/exporter_bench_test.go(1 hunks)router/internal/exporter/sink.go(1 hunks)router/internal/graphqlmetrics/exporter.go(0 hunks)router/internal/graphqlmetrics/graphql_exporter.go(1 hunks)router/internal/graphqlmetrics/graphql_exporter_test.go(17 hunks)router/internal/graphqlmetrics/graphql_metrics_sink.go(1 hunks)router/internal/graphqlmetrics/prometheus_exporter.go(1 hunks)router/internal/graphqlmetrics/prometheus_sink.go(1 hunks)router/pkg/config/config.go(1 hunks)router/pkg/config/config.schema.json(1 hunks)router/pkg/config/config_test.go(1 hunks)router/pkg/config/testdata/config_defaults.json(1 hunks)router/pkg/config/testdata/config_full.json(1 hunks)router/pkg/graphqlschemausage/schemausage.go(7 hunks)router/pkg/graphqlschemausage/schemausage_bench_test.go(1 hunks)router/pkg/metric/config.go(2 hunks)
💤 Files with no reviewable changes (1)
- router/internal/graphqlmetrics/exporter.go
🚧 Files skipped from review as they are similar to previous changes (7)
- router/core/router_config.go
- router/pkg/graphqlschemausage/schemausage.go
- router/pkg/config/config_test.go
- router/internal/graphqlmetrics/graphql_metrics_sink.go
- router/internal/graphqlmetrics/graphql_exporter.go
- router/core/context.go
- router/pkg/config/testdata/config_full.json
🧰 Additional context used
🧠 Learnings (25)
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
Applied to files:
router/pkg/config/config.schema.jsonrouter/internal/exporter/exporter.gorouter/core/router_metrics.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/pkg/config/config.schema.jsonrouter/core/router_metrics.gorouter/pkg/config/testdata/config_defaults.jsonrouter/core/operation_metrics.go
📚 Learning: 2025-08-26T17:19:28.178Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.
Applied to files:
router/pkg/metric/config.gorouter/core/operation_metrics.go
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.
Applied to files:
router/internal/graphqlmetrics/graphql_exporter_test.gorouter-tests/prometheus_improved_test.gorouter/internal/exporter/exporter.gorouter/core/router_metrics.gorouter-tests/testenv/testenv.gorouter/core/operation_metrics.go
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.
Applied to files:
router-tests/prometheus_improved_test.gorouter/core/router_metrics.gorouter/core/graph_server.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.
Applied to files:
router/internal/exporter/exporter.gorouter/core/operation_metrics.go
📚 Learning: 2025-08-28T10:07:27.650Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
Applied to files:
router/internal/exporter/exporter.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().
Applied to files:
router/internal/exporter/exporter.gorouter/core/operation_metrics.go
📚 Learning: 2025-09-17T20:50:58.518Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/internal/expr/retry_expression.go:63-67
Timestamp: 2025-09-17T20:50:58.518Z
Learning: In router/internal/expr/retry_expression.go, the maintainer SkArchon accepts the "silent no-retry" behavior when a retry expression program isn't precompiled (returning false, nil on cache miss). This is considered acceptable because config reloading is intended as a dev feature, making the potential footgun less problematic in the expected usage context.
Applied to files:
router/internal/exporter/exporter.go
📚 Learning: 2025-08-26T17:26:47.012Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2167
File: router/internal/retrytransport/retry_transport.go:84-90
Timestamp: 2025-08-26T17:26:47.012Z
Learning: In the Cosmo router retry system, MaxDuration has a default value of 15 seconds in config defaults. When MaxDuration is explicitly set to 0, it represents a user choice to disable retry duration limits entirely, so clamping Retry-After durations to 0 in this case is intentional behavior.
Applied to files:
router/internal/exporter/exporter.go
📚 Learning: 2025-08-26T17:26:47.012Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2167
File: router/internal/retrytransport/retry_transport.go:84-90
Timestamp: 2025-08-26T17:26:47.012Z
Learning: In the Cosmo router retry system, MaxDuration has a default value of 10 seconds (envDefault:"10s") in the config. When MaxDuration is explicitly set to 0, it represents a user choice to disable retry duration limits entirely, so clamping Retry-After durations to 0 in this case is intentional behavior.
Applied to files:
router/internal/exporter/exporter.go
📚 Learning: 2025-09-01T13:26:16.284Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/retry_builder.go:13-15
Timestamp: 2025-09-01T13:26:16.284Z
Learning: In router/core/retry_builder.go, the BuildRetryFunction's manager parameter is guaranteed to be non-nil by the calling code, so defensive nil checks are not needed according to the team's preference.
Applied to files:
router/internal/exporter/exporter.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Applied to files:
router/core/router_metrics.gorouter/core/router.gorouter/core/graph_server.gorouter/internal/graphqlmetrics/prometheus_exporter.go
📚 Learning: 2025-09-02T12:52:27.677Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function at lines 571-578, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the computed query hash before any APQ operations occur. There's also a test case that verifies this behavior.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-09-02T12:52:27.677Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the query body before any APQ operations occur.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.
Applied to files:
router/core/router_metrics.gorouter/core/operation_metrics.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
Applied to files:
router/core/router_metrics.gorouter/core/graph_server.go
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.
Applied to files:
router/core/router_metrics.gorouter/core/graph_server.go
📚 Learning: 2025-08-12T13:50:45.964Z
Learnt from: Noroth
Repo: wundergraph/cosmo PR: 2132
File: router-plugin/plugin.go:139-146
Timestamp: 2025-08-12T13:50:45.964Z
Learning: In the Cosmo router plugin system, the plugin framework creates its own logger independently. When creating a logger in NewRouterPlugin, it's only needed for the gRPC server setup (passed via setup.GrpcServerInitOpts.Logger) and doesn't need to be assigned back to serveConfig.Logger.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.
Applied to files:
router/core/router_metrics.gorouter/core/graph_server.go
📚 Learning: 2025-09-29T10:28:07.122Z
Learnt from: dkorittki
Repo: wundergraph/cosmo PR: 2192
File: router/pkg/config/config.go:1028-1029
Timestamp: 2025-09-29T10:28:07.122Z
Learning: The deprecation strategy for IntrospectionEnabled field in router/pkg/config/config.go is to first mark it as deprecated, then migrate all call sites to use IntrospectionConfig.Enabled, and finally remove the deprecated field. The envDefault tag should be kept during the deprecation period for backward compatibility.
Applied to files:
router-tests/testenv/testenv.go
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.
Applied to files:
router/core/router.gorouter/core/graph_server.go
📚 Learning: 2025-11-19T15:13:57.821Z
Learnt from: dkorittki
Repo: wundergraph/cosmo PR: 2273
File: router/core/graphql_handler.go:0-0
Timestamp: 2025-11-19T15:13:57.821Z
Learning: In the Cosmo router (wundergraph/cosmo), error handling follows a two-phase pattern: (1) Prehandler phase handles request parsing, validation, and setup errors using `httpGraphqlError` and `writeOperationError` (in files like graphql_prehandler.go, operation_processor.go, parse_multipart.go, batch.go); (2) Execution phase handles resolver execution errors using `WriteError` in GraphQLHandler.ServeHTTP. Because all `httpGraphqlError` instances are caught in the prehandler before ServeHTTP is invoked, any error type checks for `httpGraphqlError` in the execution-phase WriteError method are unreachable code.
Applied to files:
router/core/router.gorouter/core/graph_server.go
📚 Learning: 2025-09-03T11:38:45.855Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2183
File: router/internal/traceclient/traceclient.go:115-118
Timestamp: 2025-09-03T11:38:45.855Z
Learning: In the Cosmo codebase, the exprContext obtained from getExprContext in TraceInjectingRoundTripper.processConnectionMetrics is expected to always be non-nil as part of the existing system behavior, so defensive nil checks are not needed.
Applied to files:
router/core/operation_metrics.go
🧬 Code graph analysis (14)
router/internal/exporter/exporter_bench_test.go (1)
router/internal/exporter/exporter.go (3)
ExporterSettings(52-63)RetryOptions(35-40)NewExporter(84-120)
router/pkg/metric/config.go (2)
router/pkg/config/config.go (1)
PrometheusSchemaFieldUsageExporter(121-126)router-tests/testenv/testenv.go (1)
PrometheusSchemaFieldUsageExporter(285-290)
router/pkg/config/config.go (2)
router-tests/testenv/testenv.go (1)
PrometheusSchemaFieldUsageExporter(285-290)router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)
router/internal/graphqlmetrics/graphql_exporter_test.go (2)
router/internal/graphqlmetrics/graphql_exporter.go (1)
NewGraphQLMetricsExporter(20-40)router/internal/exporter/exporter.go (2)
ExporterSettings(52-63)RetryOptions(35-40)
router-tests/prometheus_improved_test.go (4)
router-tests/testenv/testenv.go (4)
Run(105-122)PrometheusSchemaFieldUsageExporter(285-290)Environment(1763-1799)GraphQLRequest(1939-1947)router/pkg/config/config.go (1)
PrometheusSchemaFieldUsageExporter(121-126)router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)router/pkg/otel/attributes.go (2)
WgOperationName(10-10)WgOperationType(11-11)
router/internal/exporter/exporter.go (1)
router/internal/exporter/sink.go (2)
Sink(9-18)SinkErrorHandler(23-23)
router/core/router_metrics.go (5)
router/core/operation_metrics.go (1)
OperationMetrics(31-40)router/internal/graphqlmetrics/graphql_exporter.go (1)
GraphQLMetricsExporter(14-16)router/internal/graphqlmetrics/prometheus_exporter.go (1)
PrometheusMetricsExporter(14-16)graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (13)
OperationInfo(362-373)OperationInfo(388-388)OperationInfo(403-405)OperationType(26-26)OperationType(58-60)OperationType(62-64)OperationType(71-73)SchemaUsageInfo(130-153)SchemaUsageInfo(168-168)SchemaUsageInfo(183-185)SchemaInfo(428-435)SchemaInfo(450-450)SchemaInfo(465-467)router/core/context.go (4)
OperationType(503-503)OperationTypeQuery(506-506)OperationTypeMutation(507-507)OperationTypeSubscription(508-508)
router-tests/testenv/testenv.go (2)
router/pkg/config/config.go (2)
PrometheusSchemaFieldUsageExporter(121-126)PrometheusSchemaFieldUsage(115-119)router/pkg/metric/config.go (2)
PrometheusSchemaFieldUsageExporter(48-53)PrometheusSchemaFieldUsage(42-46)
router/pkg/graphqlschemausage/schemausage_bench_test.go (2)
router/pkg/graphqlschemausage/schemausage_test.go (1)
FakeFactory(527-529)router/pkg/graphqlschemausage/schemausage.go (4)
GetTypeFieldUsageInfo(14-23)GetArgumentUsageInfo(108-125)GetInputUsageInfo(160-172)TypeFieldMetrics(26-26)
router/core/router.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
NewGraphQLMetricsExporter(20-40)router/internal/exporter/exporter.go (2)
NewDefaultExporterSettings(65-78)Exporter(17-33)router/pkg/config/config.go (3)
Prometheus(99-113)PrometheusSchemaFieldUsageExporter(121-126)Metrics(137-142)router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)
router/core/graph_server.go (4)
router/internal/graphqlmetrics/prometheus_exporter.go (2)
PrometheusMetricsExporter(14-16)NewPrometheusMetricsExporter(20-46)router/pkg/config/config.go (1)
Prometheus(99-113)router/internal/exporter/exporter.go (3)
ExporterSettings(52-63)Exporter(17-33)RetryOptions(35-40)router/core/router_metrics.go (1)
NewRouterMetrics(44-53)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
Store(149-162)router/pkg/otel/attributes.go (5)
WgOperationName(10-10)WgOperationType(11-11)WgOperationSha256(59-59)WgGraphQLFieldName(60-60)WgGraphQLParentType(61-61)router/pkg/graphqlschemausage/schemausage.go (1)
TypeFieldMetrics(26-26)
router/core/operation_metrics.go (1)
router/core/router_metrics.go (1)
RouterMetrics(15-22)
router/internal/graphqlmetrics/prometheus_exporter.go (3)
router/internal/exporter/exporter.go (3)
Exporter(17-33)ExporterSettings(52-63)NewExporter(84-120)router/pkg/metric/metric_store.go (1)
Store(149-162)router/internal/graphqlmetrics/prometheus_sink.go (2)
NewPrometheusSink(40-46)PrometheusSinkConfig(24-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: integration_test (./events)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: build_test
- GitHub Check: Analyze (go)
🔇 Additional comments (22)
router/internal/exporter/sink.go (1)
7-23: Sink interface and error handler look well-shaped for generic exportersThe concurrency-safe
Sink[T]contract andSinkErrorHandlercallback cleanly match the needs of the genericExporter[T]; no changes needed here from a correctness or API-design standpoint.router/internal/exporter/exporter_bench_test.go (1)
13-238: Benchmarks comprehensively cover exporter hot pathsThe mock sink and the set of benchmarks (allocation, throughput, batch cycling, buffer growth, and parallel recording) are well-structured and map directly onto the exporter’s key behaviors. Setup is kept outside the measured sections, and Shutdown is called to flush state, which should give useful and stable signals for future optimizations.
router/pkg/graphqlschemausage/schemausage_bench_test.go (1)
17-194: Schema-usage benchmarks realistically exercise the extraction pipelineThe
setupBenchmarkhelper builds a realistic plan and variable set, and each benchmark cleanly focuses on a specific stage (field, argument, input usage, protobuf conversion, and end-to-end). Timer reset and allocation reporting are placed correctly so setup cost isn’t counted. This is a solid basis for tracking allocation and CPU impact of the new exporter wiring.router/core/operation_metrics.go (1)
32-39: Usage tracking gating between GraphQL and Prometheus exporters looks correctThe added
trackUsageInfo/prometheusTrackUsageInfoflags and thereqContext.operation != nil && !SkipLoaderguard cleanly separate GraphQL vs Prometheus schema-usage export while avoiding emissions for planning-only operations. The options struct wiring innewOperationMetricsmatches the struct fields, so callers can independently toggle both paths.Also applies to: 77-88, 92-99, 109-116
router/pkg/config/testdata/config_defaults.json (1)
66-71: LGTM! Configuration structure correctly updated.The replacement of the scalar
SampleRatefield with a nestedExporterconfiguration object aligns with the PR's goal of moving to a batch exporter framework. The default values are consistent with the Go struct definitions.router/pkg/config/config.go (1)
115-126: LGTM! Configuration types properly defined.The new nested configuration structure maintains type safety and follows the codebase's environment variable naming conventions. The
envPrefixpropagation ensures clean environment variable names likePROMETHEUS_SCHEMA_FIELD_USAGE_EXPORTER_BATCH_SIZE.router/pkg/metric/config.go (1)
6-6: LGTM! Import added for new time.Duration fields.The time package import is correctly added to support the
IntervalandExportTimeoutduration fields in the exporter configuration.router/core/graph_server.go (5)
38-40: LGTM! Imports correctly added for new Prometheus exporter.The imports for the internal
exporterandgraphqlmetricspackages are necessary for the new Prometheus schema field usage exporter functionality.
523-523: LGTM! Exporter field properly added to graphMux.The
prometheusMetricsExporterfield is correctly typed and positioned within thegraphMuxstruct for lifecycle management.
771-775: LGTM! Shutdown logic properly implemented.The Prometheus exporter shutdown is correctly integrated into the
graphMux.Shutdownmethod with proper nil checking and error handling.
951-958: LGTM! RouterMetrics properly wired with Prometheus exporter.The
prometheusMetricsExporteris correctly passed toNewRouterMetricsalongside the existinggqlMetricsExporter, enabling both export paths as intended.
924-949: Retry option defaults are necessary per exporter validation logic.The default values for
MaxRetry,MaxDuration, andIntervalat lines 934-936 are required. The exporter'svalidate()method (router/internal/exporter/exporter.go:139-149) unconditionally validates that all retry option fields must be positive, regardless of theEnabledflag. While these fields are not used whenEnabled=false(early return at line 261), the validation still enforces positive values, making the defaults necessary to pass validation.The implementation is correct and follows the existing benchmark test patterns.
router/internal/graphqlmetrics/prometheus_exporter.go (3)
20-46: LGTM! Constructor properly initializes Prometheus exporter.The constructor correctly:
- Creates a
PrometheusSinkwith the provided configuration- Sets up a non-retrying error handler (appropriate for local Prometheus metrics)
- Wraps the generic exporter with a clean API
The comments explaining why retry is disabled are helpful.
51-59: LGTM! Public API methods are well-documented and properly delegate.The
RecordUsageandShutdownmethods provide clear documentation and correctly delegate to the underlying generic exporter. The comments explain behavior clearly, including thesynchronousparameter and return value semantics.
1-59: LGTM! Well-designed wrapper for Prometheus metrics export.This new file provides a clean, focused API for Prometheus schema usage metrics. The design:
- Properly wraps the generic exporter with Prometheus-specific configuration
- Uses composition over inheritance effectively
- Maintains type safety with the generic
Exporter[*graphqlmetrics.SchemaUsageInfo]- Provides clear documentation
- Follows established patterns in the codebase
The implementation is straightforward and maintainable.
router-tests/testenv/testenv.go (1)
279-290: Prometheus schema usage exporter wiring in testenv matches production configThe new
PrometheusSchemaFieldUsageExportertest type and the construction ofpromSchemaUsage(including test-friendly defaults and optional overrides) are consistent withrouter/pkg/metric.PrometheusSchemaFieldUsageExporterand the new exporter-based design. This should keep tests aligned with real configuration behavior while shortening flush intervals.Also applies to: 1508-1552
router-tests/prometheus_improved_test.go (1)
352-357: The code uses valid Go 1.22+ syntax and does not require changesThe
for range Nsyntax at lines 352-357 and 406-411 is valid Go code. Go 1.22 introduced the ability to range over integers, makingfor range nsyntax valid for iterating n times in loops. The codebase requires Go 1.25 or later, which fully supports this syntax. The suggested replacement withfor i := 0; i < N; i++would be an unnecessary regression to older patterns. The code is correct as written.Likely an incorrect or invalid review comment.
router/internal/graphqlmetrics/graphql_exporter_test.go (1)
73-109: <!-- [scratchpad_start]
[task verify_review_comment_1/1]
Verify Go syntax claim aboutfor i := range totalItemsvalidity.[observations]
- Confirmed: all 5 line ranges have
for i := range totalItems {- Confirmed: totalItems declared as int in all cases (100, 100, 10, 5, 100)
- router/go.mod specifies: go 1.25
[critical_finding]
- Go 1.22 introduced range-over-integers feature
for i := range nwhere n is int is VALID in Go 1.22+- This project requires Go 1.25, which fully supports this syntax
- Review comment is based on pre-1.22 Go knowledge
- Code is CORRECT and needs NO changes
[analysis]
- Range-over-int is a documented, standard Go 1.22+ feature
- Syntax matches the intended behavior: iterate from 0 to n-1
- No compilation error will occur with Go 1.25
[done]
Verification complete: review comment is INCORRECT
The review comment is incorrect and should be disregarded.
Go 1.22 introduced support for ranging over integers, enabling the syntax
for i := range nwherenis anint. This project requires Go 1.25 (perrouter/go.mod), which fully supports this feature. The code at all specified line ranges is valid Go and compiles without error. The loops correctly iterate from 0 tototalItems-1as intended.Likely an incorrect or invalid review comment.
router/core/router.go (2)
842-852: Prometheus schema usage enabling & logging behavior is coherent with deferred exporter creation.The new block correctly guards schema-usage behavior on
PromSchemaFieldUsage.Enabledand defers actual exporter creation to graph mux wiring, while logginginclude_operation_shafor visibility. This matches the design where each mux has its own metric store and keeps bootstrap light.
2324-2333: Telemetry → Prometheus schema_usage exporter mapping is consistent.
MetricConfigFromTelemetrynow fully mapscfg.Metrics.Prometheus.SchemaFieldUsage.{Enabled,IncludeOperationSha,Exporter.*}intormetric.PrometheusSchemaFieldUsage, matching the config structs inrouter/pkg/configandrouter/pkg/metric/config.go. This should preserve schema validation guarantees and remove the old sampling knob cleanly.router/core/router_metrics.go (2)
15-22: RouterMetrics wiring for GraphQL & Prometheus exporters is sound.The interface and struct/config changes cleanly separate GraphQL and Prometheus exporters while keeping initialization centralized in
NewRouterMetrics. UsingexportEnabledfor the GraphQL path andprometheusMetricsExporter != nilforPrometheusTrackUsageInfois a straightforward way to gate the two sinks independently, and the new getters (GQLMetricsExporter,PrometheusMetricsExporter) preserve encapsulation.Also applies to: 26-52, 59-68, 76-82
84-138: GraphQL schema usage export now robust to missing exporter and preserves cached structures.The extra
gqlMetricsExporter == nilguard ensuresExportSchemaUsageInfois safe even if called directly from aRouterMetricsimplementation. ReusingGetTypeFieldUsageInfoMetricsand the cached argument/input slices while cloning the operation name viastrings.Clonerespects the comments about not mutating shared state and should keep CPU/memory overhead low for unsampled exports.
This PR ensures that we never dismiss operation usage due to sampling. We eliminated sampling entirely while maintaining the same CPU util efficiency. Increase in total alloc memory is expected. With this change operations with low occurrence are always tracked.
I achieved this by making the gqlmetrics batch collector generic and use it for both implementations.
Fixes ENG-8486
CPU
Memory
Config
Summary by CodeRabbit
Release Notes
sample_ratesetting with a newexporterconfiguration structure that includes batch size, queue size, flush interval, and export timeout settings.✏️ Tip: You can customize this high-level summary in your review settings.
Checklist