Skip to content

Conversation

@StarpTech
Copy link
Contributor

@StarpTech StarpTech commented Nov 13, 2025

This PR ensures that we never dismiss operation usage due to sampling. We eliminated sampling entirely while maintaining the same CPU util efficiency. Increase in total alloc memory is expected. With this change operations with low occurrence are always tracked.

I achieved this by making the gqlmetrics batch collector generic and use it for both implementations.

Fixes ENG-8486

CPU

CleanShot 2025-11-13 at 23 36 01@2x

Memory

CleanShot 2025-11-13 at 23 38 28@2x

Config

telemetry:
  metrics:
    otlp:
      enabled: false
    prometheus:
      schema_usage:
        enabled: true
        exporter:
          batch_size: 4096
          export_timeout: 10s
          interval: 10s
          queue_size: 10240

Summary by CodeRabbit

Release Notes

  • Configuration Changes
    • Schema field usage metrics configuration for Prometheus has been updated. Replace the sample_rate setting with a new exporter configuration structure that includes batch size, queue size, flush interval, and export timeout settings.

✏️ Tip: You can customize this high-level summary in your review settings.

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/cosmo-docs.
  • I have read the Contributors Guide.

@coderabbitai
Copy link

coderabbitai bot commented Nov 13, 2025

Walkthrough

This PR restructures Prometheus schema field usage metrics export from a sampling-based approach to a configurable batch exporter framework. It introduces a generic Exporter pattern, separates GraphQL and Prometheus metric exporters into dedicated implementations, removes per-operation sampling logic, and updates configuration/wiring across the router core and metrics pipeline.

Changes

Cohort / File(s) Summary
Generic Exporter Framework
router/internal/exporter/exporter.go, router/internal/exporter/sink.go, router/internal/exporter/exporter_bench_test.go
New generic batch exporter with configurable batching, queuing, retry semantics, and background flush loop; new Sink interface and SinkErrorHandler type for extensible export backends; benchmark tests for throughput and allocation analysis.
GraphQL Metrics Exporter
router/internal/graphqlmetrics/graphql_exporter.go, router/internal/graphqlmetrics/graphql_metrics_sink.go, router/internal/graphqlmetrics/graphql_exporter_test.go
New GraphQL-specific exporter wrapping generic Exporter; new sink implementing Connect RPC-based aggregation and retry logic; updated tests to use NewGraphQLMetricsExporter and validate aggregation-based publishing.
Prometheus Metrics Exporter
router/internal/graphqlmetrics/prometheus_exporter.go, router/internal/graphqlmetrics/prometheus_sink.go
New Prometheus-specific exporter wrapping generic Exporter; new sink implementing OpenTelemetry-based field usage metrics emission with aggregation per operation/field pair.
Old Exporter Removal
router/internal/graphqlmetrics/exporter.go
Removed GraphQL-specific Exporter type, all associated methods, retry/batching logic, and queue management; replaced with generic pattern.
Configuration Schema
router/pkg/config/config.go, router/pkg/config/config.schema.json, router/pkg/metric/config.go, router/pkg/config/testdata/config_defaults.json, router/pkg/config/testdata/config_full.json
Replaced SampleRate field in PrometheusSchemaFieldUsage with nested Exporter configuration (BatchSize, QueueSize, Interval, ExportTimeout); updated JSON schema with exporter defaults and validation rules.
Test Environment
router-tests/testenv/testenv.go, router-tests/prometheus_improved_test.go
Introduced PrometheusSchemaFieldUsageExporter struct; updated configureRouter to construct and populate exporter settings; refactored test cases to use dynamic metric verification via fieldTypePairs map instead of index-based assertions; added time-based flush waits.
Router Core Wiring
router/core/graph_server.go, router/core/router.go, router/core/router_config.go
Added prometheusMetricsExporter field to graph mux; instantiate PrometheusMetricsExporter when PromSchemaFieldUsage is enabled; renamed GraphQL exporter constructor to NewGraphQLMetricsExporter; updated Shutdown to handle prometheus exporter lifecycle.
Operation Metrics
router/core/operation_metrics.go
Removed sampling logic (shouldSampleOperation, usageKey, SampleRate fields); replaced with boolean PrometheusTrackUsageInfo flag; simplified Finish to conditionally export via trackUsageInfo and prometheusTrackUsageInfo; removed per-field counting and SHA256 aggregation.
Router Metrics
router/core/router_metrics.go
Renamed GqlMetricsExporter to GQLMetricsExporter; added PrometheusMetricsExporter field and accessor; added ExportSchemaUsageInfoPrometheus method; refactored StartOperation to set PrometheusTrackUsageInfo flag; replaced unsafebytes with strings.Clone; updated ExportSchemaUsageInfo to use GetTypeFieldUsageInfoMetrics.
Context & Schema Usage
router/core/context.go, router/pkg/graphqlschemausage/schemausage.go, router/pkg/graphqlschemausage/schemausage_bench_test.go
Added GetTypeFieldUsageInfoMetrics accessor with lazy caching; optimized allocations via pre-allocation and exact capacity slicing; added comprehensive benchmark tests for schema usage operations.
Configuration Tests & Demo
router/pkg/config/config_test.go, router/demo.config.yaml
Added test groups for exporter defaults, custom exporter config from file/environment, and precedence validation; updated demo config with exporter interval settings and expanded events providers (NATS, Kafka, Redis).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • Key attention areas:
    • Generic Exporter[T] pattern in router/internal/exporter/exporter.go — verify thread safety of queue/buffer management, retry backoff logic, and graceful shutdown semantics.
    • PrometheusMetricsExporter and PrometheusSink integration in router/internal/graphqlmetrics/prometheus_*.go — ensure aggregation logic correctly groups by operation/field pairs and metric emission is accurate.
    • Router wiring in router/core/graph_server.go and router/core/router_metrics.go — confirm lifecycle management (startup/shutdown) and delegation paths between GraphQL and Prometheus exporters.
    • Sampling removal in router/core/operation_metrics.go — verify all per-request metric export now flows through exporter-driven paths and no logic gaps exist from the sampling removal.
    • Test refactoring in router-tests/prometheus_improved_test.go — validate that dynamic metric verification via fieldTypePairs is equivalent to previous fixed assertions and covers all expected field/type combinations.

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.14% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: introducing a Prometheus sink for gqlmetrics and ensuring all operations are tracked efficiently without sampling.
✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Nov 13, 2025

Router image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-6614c9681446c4fce85a3463dcdc1b5037b1a9e8

@github-actions
Copy link

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-ced880bbbe6f9957e486f7deb9115c6b40f2a1bb-nonroot

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/core/graph_server.go (1)

758-775: Shutdown ordering: stop exporter before closing metric store.

Currently metricStore is shut down before prometheusMetricsExporter. If exporter is still flushing, writes can fail. Shut down the exporter first, then the stores.

Apply:

- if s.metricStore != nil {
-   if aErr := s.metricStore.Shutdown(ctx); aErr != nil {
-     err = errors.Join(err, aErr)
-   }
- }
- ...
- if s.prometheusMetricsExporter != nil {
-   if aErr := s.prometheusMetricsExporter.Shutdown(ctx); aErr != nil {
-     err = errors.Join(err, aErr)
-   }
- }
+ if s.prometheusMetricsExporter != nil {
+   if aErr := s.prometheusMetricsExporter.Shutdown(ctx); aErr != nil {
+     err = errors.Join(err, aErr)
+   }
+ }
+ if s.metricStore != nil {
+   if aErr := s.metricStore.Shutdown(ctx); aErr != nil {
+     err = errors.Join(err, aErr)
+   }
+ }
🧹 Nitpick comments (14)
router/internal/graphqlmetrics/graphql_metrics_sink.go (2)

29-36: Handle potential nil logger in NewGraphQLMetricsSink

cfg.Logger.With(...) will panic if Logger is ever nil (e.g. in tests or ad-hoc usage). Consider either guarding and falling back to zap.NewNop() or documenting/enforcing a non-nil logger requirement at call sites.


70-91: Consider tightening retry classification in IsRetryableError

Right now any non-connect.Error (including context cancellations or local/config errors) is treated as retryable. If such errors can reach this helper, consider explicitly short‑circuiting on context.Canceled / context.DeadlineExceeded or other known permanent failures in the caller to avoid pointless retries.

router/pkg/config/config.schema.json (1)

1249-1281: Exporter schema matches the new Prometheus usage exporter config

The schema_usage.exporter block (batch_size, queue_size, interval, export_timeout) and defaults are consistent and align with the new Go config type. If you want stricter validation, you could optionally add "additionalProperties": false to the exporter object to catch typos in config keys, but the current schema is functionally fine.

router/internal/graphqlmetrics/exporter_test.go (1)

369-375: Avoid fixed sleeps in tests; poll until metrics observed.

Replace time.Sleep with polling (e.g., require.Eventually) or call Shutdown before assertions to reduce flakiness.

Example:

- time.Sleep(200 * time.Millisecond)
+ require.Eventually(t, func() bool {
+   _, err := promRegistry.Gather()
+   return err == nil && len(c.publishedAggregations) > 0
+ }, 2*time.Second, 50*time.Millisecond)

Also applies to: 371-375

router-tests/testenv/testenv.go (1)

279-290: Clamp exporter settings to safe minimums in tests.

If a caller supplies zeros, normalize to sane minimums (e.g., BatchSize>=1, QueueSize>=1, Interval>0) before wiring to router metric config to prevent silent stalls.

Also applies to: 1508-1552

router-tests/prometheus_improved_test.go (1)

82-96: Reduce flakiness: replace fixed sleeps with Eventually/polling.

Use require.Eventually to wait for metrics flush instead of fixed sleeps. Keeps tests stable on slow CI.

Also applies to: 166-171, 229-233, 269-274, 315-321, 359-365, 413-419

router/core/graph_server.go (1)

923-949: Defaults/validation pass-through for exporter settings.

When wiring settings from cfg.Exporter, ensure they are validated/clamped (BatchSize, QueueSize, Interval, ExportTimeout) before use to prevent zero/negative configs from stalling the exporter.

router/core/router.go (2)

841-851: Prometheus schema usage: consider logging exporter settings as well

The enablement log for Prometheus schema field usage is clear. Since batching/queue behavior is now configurable, consider also logging the effective exporter settings (batch size, queue size, interval, export timeout) here to simplify debugging misconfigurations.


2324-2332: Exporter settings mapping is correct; be aware of validation constraints

The mapping from cfg.Metrics.Prometheus.SchemaFieldUsage.Exporter into rmetric.PrometheusSchemaFieldUsageExporter is straightforward and correct. Note that the generic exporter’s validate() requires all these numeric fields to be strictly positive, so a YAML/env override of 0 or negative values will now fail exporter creation. If that’s not desired, you may want to clamp or normalize these values earlier in config handling.

router/core/router_metrics.go (1)

135-159: Prometheus export path is lightweight; consider extra guard and hash semantics

The Prometheus-specific SchemaUsageInfo payload deliberately includes only type field metrics, operation hash/name, and schema version, which is a good fit for local metric cardinality and memory use.

Two minor points:

  • To harden against future misuse, you could early-return if m.prometheusMetricsExporter is nil, even though current callers gate on PrometheusUsageInfoTrack.
  • This path uses operationContext.sha256Hash directly instead of HashString(). Please confirm that sha256Hash is always initialized and matches the intended “operation SHA” semantics for all operation types.
router/internal/graphqlmetrics/exporter.go (4)

78-111: NewExporter wiring is correct; consider guarding nil settings

Constructor behavior is good: it enforces a non-nil sink, installs a default retryability handler, validates settings, and starts the batching goroutine with a component-scoped logger.

One robustness nit: settings is dereferenced without a nil check when creating the queue and during validation. Since this is an internal API, it’s probably fine, but you might want to either:

  • Treat nil as “use NewDefaultExporterSettings()”, or
  • Explicitly return an error like fmt.Errorf("settings cannot be nil").

This avoids surprising panics if a future caller accidentally passes nil.


113-143: Validation enforces positive values; retries disabled still require fields

validate() enforces strictly positive values for batch/queue sizes, interval, export timeout, and all retry fields. That’s good for catching misconfigurations early.

Note that it requires retry fields to be positive even when RetryOptions.Enabled is false. If you later expose Enabled as a user-facing toggle, you may want to relax validation so that only enabled retries need fully-populated options.


179-282: Batch export and retry logic are reasonable; consider respecting shutdown context for backoff

exportBatch wraps the shared export context with a per-call timeout and delegates to the sink, logging failures at debug. exportBatchWithRetry adds bounded exponential backoff with:

  • Retryability decided by isRetryableError.
  • Max duration, interval, and max attempts from settings.
  • Clear logs on non-retryable errors and after exhausting retries.

One potential refinement: the retry loop currently sleeps with time.Sleep and doesn’t inspect any shutdown signal or context, so a stubbornly failing sink can delay shutdown up to the cumulative backoff time. If you ever need faster shutdowns under failure, you could periodically check e.exportRequestContext.Done() or a dedicated shutdown flag inside the retry loop to abort early.


344-381: Shutdown sequence is mostly robust; document single-use expectation

Shutdown:

  • Stops accepting traffic, signals the batching goroutine to drain and exit, and then polls inflightBatches until zero or context cancel.
  • Defers cancellation of export contexts and sink closure, with logs on errors and on completion.

Two small caveats:

  • Shutdown is not idempotent: calling it twice will panic on the second close of acceptTrafficSema/shutdownSignal. It would help to either document “must only be called once” or add a guard (e.g., sync.Once).
  • Synchronous Record(..., true) calls are not tracked in inflightBatches, so they won’t delay shutdown waiting; they may still race with sink.Close if invoked concurrently. That’s likely acceptable for metrics but worth keeping in mind.

Overall, the shutdown strategy is consistent with a best-effort metrics pipeline.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac085ab and bcddb08.

📒 Files selected for processing (17)
  • router-tests/prometheus_improved_test.go (16 hunks)
  • router-tests/testenv/testenv.go (3 hunks)
  • router/core/graph_server.go (4 hunks)
  • router/core/operation_metrics.go (3 hunks)
  • router/core/router.go (3 hunks)
  • router/core/router_config.go (1 hunks)
  • router/core/router_metrics.go (4 hunks)
  • router/internal/graphqlmetrics/exporter.go (5 hunks)
  • router/internal/graphqlmetrics/exporter_test.go (7 hunks)
  • router/internal/graphqlmetrics/graphql_exporter.go (1 hunks)
  • router/internal/graphqlmetrics/graphql_metrics_sink.go (1 hunks)
  • router/internal/graphqlmetrics/prometheus_exporter.go (1 hunks)
  • router/internal/graphqlmetrics/prometheus_sink.go (1 hunks)
  • router/internal/graphqlmetrics/sink.go (1 hunks)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/metric/config.go (2 hunks)
🧰 Additional context used
🧠 Learnings (13)
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router/internal/graphqlmetrics/prometheus_exporter.go
  • router/core/router_config.go
  • router/core/router_metrics.go
  • router/core/graph_server.go
  • router/core/router.go
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router-tests/testenv/testenv.go
  • router-tests/prometheus_improved_test.go
  • router/internal/graphqlmetrics/exporter_test.go
  • router/core/operation_metrics.go
📚 Learning: 2025-09-29T10:28:07.122Z
Learnt from: dkorittki
Repo: wundergraph/cosmo PR: 2192
File: router/pkg/config/config.go:1028-1029
Timestamp: 2025-09-29T10:28:07.122Z
Learning: The deprecation strategy for IntrospectionEnabled field in router/pkg/config/config.go is to first mark it as deprecated, then migrate all call sites to use IntrospectionConfig.Enabled, and finally remove the deprecated field. The envDefault tag should be kept during the deprecation period for backward compatibility.

Applied to files:

  • router-tests/testenv/testenv.go
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

  • router/core/router_config.go
  • router/core/graph_server.go
  • router/core/router.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

  • router/core/router_config.go
  • router/pkg/config/config.schema.json
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

  • router/core/router_config.go
  • router/core/router_metrics.go
  • router/core/graph_server.go
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

  • router-tests/prometheus_improved_test.go
  • router/core/graph_server.go
  • router/core/router.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/pkg/config/config.schema.json
📚 Learning: 2025-07-21T15:06:36.664Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2067
File: router/pkg/config/config.schema.json:1637-1644
Timestamp: 2025-07-21T15:06:36.664Z
Learning: In the Cosmo router project, when extending JSON schema validation for security-sensitive fields like JWKS secrets, backwards compatibility is maintained by implementing warnings in the Go code rather than hard validation constraints in the schema. This allows existing configurations to continue working while alerting users to potential security issues.

Applied to files:

  • router/pkg/config/config.schema.json
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router/core/operation_metrics.go
  • router/core/router_metrics.go
  • router/core/graph_server.go
📚 Learning: 2025-08-26T17:19:28.178Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.

Applied to files:

  • router/core/operation_metrics.go
  • router/pkg/metric/config.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.

Applied to files:

  • router/core/graph_server.go
🧬 Code graph analysis (15)
router/internal/graphqlmetrics/graphql_exporter.go (2)
router/internal/graphqlmetrics/exporter.go (3)
  • Exporter (16-31)
  • ExporterSettings (50-61)
  • NewExporter (82-111)
router/internal/graphqlmetrics/graphql_metrics_sink.go (3)
  • NewGraphQLMetricsSink (30-36)
  • GraphQLMetricsSinkConfig (23-27)
  • IsRetryableError (72-91)
router/internal/graphqlmetrics/prometheus_exporter.go (3)
router/internal/graphqlmetrics/exporter.go (3)
  • Exporter (16-31)
  • ExporterSettings (50-61)
  • NewExporter (82-111)
router/pkg/metric/metric_store.go (1)
  • Store (149-162)
router/internal/graphqlmetrics/prometheus_sink.go (2)
  • NewPrometheusSink (41-47)
  • PrometheusSinkConfig (25-29)
router-tests/testenv/testenv.go (2)
router/pkg/config/config.go (2)
  • PrometheusSchemaFieldUsageExporter (121-126)
  • PrometheusSchemaFieldUsage (115-119)
router/pkg/metric/config.go (2)
  • PrometheusSchemaFieldUsageExporter (48-53)
  • PrometheusSchemaFieldUsage (42-46)
router/pkg/config/config.go (3)
router/internal/graphqlmetrics/exporter.go (1)
  • Exporter (16-31)
router-tests/testenv/testenv.go (1)
  • PrometheusSchemaFieldUsageExporter (285-290)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
  • Store (149-162)
router/pkg/otel/attributes.go (5)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
  • WgOperationSha256 (59-59)
  • WgGraphQLFieldName (60-60)
  • WgGraphQLParentType (61-61)
router/pkg/graphqlschemausage/schemausage.go (1)
  • TypeFieldMetrics (28-28)
router/internal/graphqlmetrics/graphql_metrics_sink.go (1)
router/internal/graphqlmetrics/aggregate.go (1)
  • AggregateSchemaUsageInfoBatch (9-27)
router/core/router_config.go (2)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • GraphQLMetricsExporter (13-15)
router/internal/graphqlmetrics/prometheus_exporter.go (1)
  • PrometheusMetricsExporter (13-15)
router/internal/graphqlmetrics/exporter.go (1)
router/internal/graphqlmetrics/sink.go (2)
  • Sink (9-18)
  • SinkErrorHandler (23-23)
router-tests/prometheus_improved_test.go (4)
router-tests/testenv/testenv.go (3)
  • Run (105-122)
  • PrometheusSchemaFieldUsageExporter (285-290)
  • Environment (1763-1799)
router/pkg/config/config.go (1)
  • PrometheusSchemaFieldUsageExporter (121-126)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
router/pkg/otel/attributes.go (2)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
router/internal/graphqlmetrics/exporter_test.go (1)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • NewGraphQLMetricsExporter (19-39)
router/core/operation_metrics.go (1)
router/core/router_metrics.go (1)
  • RouterMetrics (14-21)
router/core/router_metrics.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • GraphQLMetricsExporter (13-15)
router/internal/graphqlmetrics/prometheus_exporter.go (1)
  • PrometheusMetricsExporter (13-15)
router/core/context.go (4)
  • OperationType (503-503)
  • OperationTypeQuery (506-506)
  • OperationTypeMutation (507-507)
  • OperationTypeSubscription (508-508)
router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (16)
  • OperationType (26-26)
  • OperationType (58-60)
  • OperationType (62-64)
  • OperationType (71-73)
  • OperationType_QUERY (29-29)
  • OperationType_MUTATION (30-30)
  • OperationType_SUBSCRIPTION (31-31)
  • SchemaUsageInfo (130-153)
  • SchemaUsageInfo (168-168)
  • SchemaUsageInfo (183-185)
  • OperationInfo (362-373)
  • OperationInfo (388-388)
  • OperationInfo (403-405)
  • SchemaInfo (428-435)
  • SchemaInfo (450-450)
  • SchemaInfo (465-467)
router/pkg/metric/config.go (3)
router/internal/graphqlmetrics/exporter.go (1)
  • Exporter (16-31)
router-tests/testenv/testenv.go (1)
  • PrometheusSchemaFieldUsageExporter (285-290)
router/pkg/config/config.go (1)
  • PrometheusSchemaFieldUsageExporter (121-126)
router/core/graph_server.go (4)
router/internal/graphqlmetrics/prometheus_exporter.go (2)
  • PrometheusMetricsExporter (13-15)
  • NewPrometheusMetricsExporter (19-45)
router/pkg/config/config.go (1)
  • Prometheus (99-113)
router/internal/graphqlmetrics/exporter.go (3)
  • ExporterSettings (50-61)
  • Exporter (16-31)
  • RetryOptions (33-38)
router/core/router_metrics.go (1)
  • NewRouterMetrics (43-52)
router/core/router.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • NewGraphQLMetricsExporter (19-39)
router/pkg/config/config.go (3)
  • Prometheus (99-113)
  • PrometheusSchemaFieldUsageExporter (121-126)
  • Metrics (137-142)
router/internal/graphqlmetrics/exporter.go (1)
  • Exporter (16-31)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: build_test
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: image_scan
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
🔇 Additional comments (19)
router/internal/graphqlmetrics/graphql_metrics_sink.go (1)

40-61: Export path and aggregation look correct

Batch empty-check, aggregation via AggregateSchemaUsageInfoBatch, auth header setup, and debug logging are all straightforward and concurrency-safe for this sink implementation. No functional issues from this diff.

router/core/router_config.go (1)

37-38: New exporter fields fit the updated metrics wiring

Adding gqlMetricsExporter and prometheusMetricsExporter to Config keeps the router’s metrics wiring explicit and type‑safe; no issues with this change.

router/internal/graphqlmetrics/sink.go (1)

7-23: Sink[T] abstraction is clear and compatible with the exporter design

The generic Sink[T] interface and SinkErrorHandler are minimal, well‑documented, and match the batch‑export semantics used by the generic exporter. No changes needed here.

router/pkg/metric/config.go (1)

42-53: ****

Verification confirms that all constructions of PrometheusSchemaFieldUsage with the Exporter field properly handle initialization:

  1. Production (router.go:2324): Reads all Exporter fields from validated config source
  2. Tests (testenv.go:1509-1530): Explicitly provides test-friendly defaults when enabled (BatchSize: 100, QueueSize: 1000, Interval: 100ms, ExportTimeout set)

No manual constructions with uninitialized Exporter fields exist in the codebase. The code is correct as-is.

router/internal/graphqlmetrics/prometheus_exporter.go (1)

31-41: LGTM on exporter wrapper.

Constructor and delegation are clear; disabling retries for local Prom sink is appropriate.

Also applies to: 47-58

router/internal/graphqlmetrics/graphql_exporter.go (1)

31-39: LGTM on GraphQL exporter wrapper.

API is minimal and mirrors the generic exporter cleanly.

Also applies to: 41-52

router-tests/prometheus_improved_test.go (1)

351-358: Review comment is incorrect; no changes required.

The repository targets Go 1.25 (confirmed across all go.mod files), which fully supports the for range n syntax introduced in Go 1.22. The codebase consistently uses this syntax in multiple test files (prometheus_improved_test.go, authentication_test.go, plugintest/tracing_test.go). No compatibility concern exists; the code is idiomatic for the current toolchain.

Likely an incorrect or invalid review comment.

router/internal/graphqlmetrics/prometheus_sink.go (1)

79-86: Remove this review comment; the Go version compatibility concern is invalid.

The codebase targets Go 1.25, and slices.Concat was introduced in Go 1.22, so the "newer Go" compatibility argument doesn't apply. The code is compatible with the project's Go version.

While the nil vs. empty slice suggestion ([]attribute.KeyValue{}nil) is a minor stylistic improvement, the original review overstates the concern about Go compatibility and allocation efficiency in this context.

router/pkg/config/config.go (1)

115-126: No changes required; validation is already in place.

The PrometheusSchemaFieldUsageExporter fields are validated both at the JSON schema level and at runtime:

  • Runtime: The Exporter[T].validate() method (router/internal/graphqlmetrics/exporter.go:113–142) enforces BatchSize > 0, QueueSize > 0, Interval > 0, and ExportTimeout > 0.
  • Schema: The config.schema.json (lines 1252–1278) defines minimum: 1 for batch_size/queue_size and duration: {minimum: "1s"} for interval/export_timeout.

Both layers prevent misconfiguration, so no additional validation is needed.

router/core/router.go (1)

821-839: GraphQL metrics exporter wiring looks consistent

Using NewGraphQLMetricsExporter with NewDefaultExporterSettings() cleanly adapts to the new generic exporter API; error handling and assignment into r.gqlMetricsExporter remain correct.

router/core/router_metrics.go (3)

14-21: RouterMetrics wiring for dual exporters is sound

Extending RouterMetrics and routerMetrics to carry both the GraphQL and Prometheus exporters, plus an explicit exportEnabled flag for cloud usage, cleanly separates concerns. NewRouterMetrics correctly threads these through, and StartOperation’s PrometheusUsageInfoTrack: m.prometheusMetricsExporter != nil gives a cheap, stable switch for local Prometheus tracking.

Also applies to: 26-41, 44-51


57-67: StartOperation gating correctly enables Prometheus usage tracking

Passing TrackUsageInfo and PrometheusUsageInfoTrack into OperationMetricsOptions ensures:

  • Cloud GraphQL usage export is still controlled centrally via exportEnabled.
  • Prometheus schema usage export is enabled solely by the presence of a Prometheus exporter.
    This matches the “track all operations” goal without coupling the two sinks.

75-81: Exporter accessors are minimal and sufficient

GqlMetricsExporter() and PrometheusMetricsExporter() simply expose the underlying exporters, matching the new Prometheus path needed by the graph mux. No additional state is leaked.

router/core/operation_metrics.go (2)

32-40: OperationMetrics option plumbing is consistent

Adding trackUsageInfo and prometheusUsageInfoTrack to OperationMetrics and threading them via OperationMetricsOptions keeps the decision about which sinks to hit local to the operation instance. The constructor correctly snapshots these booleans at operation start, avoiding any mid-request config races.

Also applies to: 91-100, 105-117


77-88: Dual export in Finish aligns with tracking goals

The new export block:

  • Respects SkipLoader and reqContext.operation == nil as before.
  • Uses trackUsageInfo to guard GraphQL (cloud) schema usage export.
  • Uses prometheusUsageInfoTrack to independently guard the Prometheus schema usage path.

This achieves “always track operations” for Prometheus while preserving existing cloud usage behavior.

router/internal/graphqlmetrics/exporter.go (4)

13-31: Generic exporter structure and defaults look solid

The generic Exporter[T] plus ExporterSettings and NewDefaultExporterSettings() give a reusable, thread-safe batching primitive. The fields (queue, inflightBatches, exportRequestContext) and defaults (batch size, queue size, interval, timeouts, retry) are reasonable and match the use cases for GraphQL and Prometheus sinks.

Also applies to: 50-76


145-177: Traffic gating and queue backpressure are correctly implemented

acceptTraffic() uses a closed channel to signal shutdown, and Record:

  • Sends synchronously via exportBatch when requested.
  • Uses a non-blocking send to a bounded queue otherwise, logging and dropping when full.

This prevents producer goroutines from blocking indefinitely on shutdown or overload. The synchronous path intentionally ignores the error (just logs inside exportBatch), which matches a “best-effort metrics” philosophy.


285-319: Batching loop and shutdown signal handling look correct

The start loop:

  • Maintains a preallocated buffer sized to BatchSize.
  • Flushes on timer ticks and when the buffer fills.
  • On shutdown, calls drainQueue(buffer) and exits cleanly.

This ensures no items are lost between tick and shutdown, and keeps the batching goroutine single-responsibility.


321-342: drainQueue correctly flushes remaining items during shutdown

drainQueue empties the queue into the buffer, sending full batches as they accumulate and finally sending any partial batch, with helpful debug logging. Given acceptTrafficSema is closed before shutdownSignal, no new async items should be enqueued during draining, so this is race-free.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router/pkg/config/config_test.go (1)

764-787: LGTM! Precedence test correctly validates file config priority.

The test correctly verifies that file-specified exporter values override environment variables, using clearly distinguishable values (1024 vs 9999, 5s vs 99s) to prove the behavior.

Optional: Consider testing the precedence behavior for all four exporter fields (BatchSize, QueueSize, Interval, ExportTimeout) for completeness, though testing two fields is sufficient to prove the precedence mechanism works.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bcddb08 and 6f21eb9.

📒 Files selected for processing (1)
  • router/pkg/config/config_test.go (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router/pkg/config/config_test.go
🧬 Code graph analysis (1)
router/pkg/config/config_test.go (1)
router/pkg/config/config.go (5)
  • LoadConfig (1120-1232)
  • Config (1014-1089)
  • Telemetry (157-163)
  • Metrics (137-142)
  • Prometheus (99-113)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: build-router
  • GitHub Check: image_scan
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (3)
router/pkg/config/config_test.go (3)

689-707: LGTM! Well-structured test for exporter defaults.

The test correctly verifies that exporter configuration falls back to sensible defaults (8192/16384 for batch/queue sizes, 10s for interval/timeout) when only enabled: true is specified.


709-736: LGTM! Comprehensive test for custom file configuration.

The test correctly verifies that all exporter fields (batch_size, queue_size, interval, export_timeout) can be customized via YAML configuration.


738-762: LGTM! Proper environment variable loading test.

The test correctly verifies that exporter configuration can be loaded from environment variables using the PROMETHEUS_SCHEMA_FIELD_USAGE_EXPORTER_* prefix pattern.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0e9d36 and 3fd209b.

📒 Files selected for processing (2)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/pkg/config/testdata/config_defaults.json
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (4)
router/pkg/config/testdata/config_full.json (2)

93-102: Schema change is properly tested via fixture-based tests.

The config_full.json file is exercised by config parsing tests in router/pkg/config/config_test.go at lines 320 and 1171. These tests use goldie for fixture-based JSON comparison: they load the YAML configuration and validate the parsed output matches the config_full.json fixture exactly. Since the fixture already contains the new Exporter schema with BatchSize, QueueSize, Interval, and ExportTimeout fields, the goldie tests will automatically catch any schema mismatches. If tests are passing, the schema change is validated.


96-101: Configuration field names correctly match Go struct definition.

Verification confirms the Exporter configuration in config_full.json (lines 96-101) is properly structured:

  • JSON field names (BatchSize, QueueSize, Interval, ExportTimeout) match Go struct field names in router/pkg/config/config.go:121-126
  • Field values are correct (integers for int fields, nanoseconds for time.Duration fields)
  • Both config_full.json and config_defaults.json consistently reflect the schema change
  • Tests and production code in router/core/router.go:2327-2330 correctly use these fields
router/pkg/config/testdata/config_defaults.json (2)

63-72: Structural change from SampleRate to Exporter is sound.

The replacement of the probabilistic SampleRate sampling approach with a configurable Exporter object containing BatchSize, QueueSize, Interval, and ExportTimeout properly reflects the PR objective to track all operations deterministically using batching instead of sampling.


66-71: Testdata values are correct and consistent with config schema defaults.

Verification confirms the Exporter values in the testdata file align with the actual configuration schema. The router/pkg/config/config.go struct defines PrometheusSchemaFieldUsageExporter with envDefault tags that match exactly:

  • BatchSize: envDefault:"8192" (testdata: 8192) ✓
  • QueueSize: envDefault:"16384" (testdata: 16384) ✓
  • Interval: envDefault:"10s" (testdata: 10000000000 ns = 10s) ✓
  • ExportTimeout: envDefault:"10s" (testdata: 10000000000 ns = 10s) ✓

The testdata correctly reflects the source-of-truth defaults. Any discrepancy with the PR description likely reflects earlier proposal values that were finalized to these values in the implementation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router/internal/graphqlmetrics/prometheus_sink.go (1)

81-86: Consider passing nil instead of an empty slice for the unused parameter.

Since attributes are being passed via WithAttributeSet option, the sliceAttr parameter is unused. Passing nil instead of []attribute.KeyValue{} would be more idiomatic and avoid an unnecessary empty slice allocation.

Apply this diff:

 		s.metricStore.MeasureSchemaFieldUsage(
 			ctx,
 			int64(count),
-			[]attribute.KeyValue{},
+			nil,
 			otelmetric.WithAttributeSet(attribute.NewSet(allAttrs...)),
 		)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3fd209b and 0200c3c.

📒 Files selected for processing (1)
  • router/internal/graphqlmetrics/prometheus_sink.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
  • Store (149-162)
router/pkg/otel/attributes.go (5)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
  • WgOperationSha256 (59-59)
  • WgGraphQLFieldName (60-60)
  • WgGraphQLParentType (61-61)
router/pkg/graphqlschemausage/schemausage.go (1)
  • TypeFieldMetrics (28-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build-router
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: build_push_image
  • GitHub Check: build_test
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
router/internal/graphqlmetrics/prometheus_sink.go (1)

127-130: No change needed — parent type and field-name extraction are correct.

TypeFieldUsageInfo.TypeNames is populated from ParentTypeNames such that index 0 is the immediate parent type, and Path contains the path segments with the field name as the last element. Verified in router/pkg/graphqlschemausage/schemausage.go (ParentTypeNames -> TypeNames), the protobuf comment (graphqlmetrics/gen/proto/.../graphqlmetrics.pb.go), and tests (router/internal/graphqlmetrics/exporter_test.go, aggregate_test.go) showing examples like Path ["user","name"] with TypeNames ["User","String"].

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
router/pkg/config/config.go (1)

116-119: Back-compat: removal of sample_rate needs migration story or explicit deprecation notice.

Dropping SampleRate in favor of Exporter will break existing configs. Either:

  • accept the breaking change and document a clear migration; or
  • add a temporary compatibility path that maps sample_rate to exporter settings.

Also ensure any references to SampleRate are removed across code, tests, docs. Reusing the schema’s deprecation mechanism can help guide users.

#!/bin/bash
# Find lingering references (code, tests, docs)
rg -n -C2 -S -i '\bsample_rate\b|SampleRate|schema_usage\.sample_rate|PROMETHEUS_SCHEMA_FIELD_USAGE_SAMPLE_RATE'
🧹 Nitpick comments (2)
router/pkg/config/config.go (1)

121-126: Type naming duplication and defaults: consider renaming and confirm queue_size default.

  • PrometheusSchemaFieldUsageExporter is also declared in router/pkg/metric/config.go. Keeping the same name in two packages is legal but confusing. Consider renaming this one to PrometheusSchemaFieldUsageExporterConfig or PrometheusSchemaFieldUsageSettings to reduce ambiguity.
  • Default queue_size here is 12800, while the PR example shows 10240. Please reconcile or update examples to avoid drift.
router/pkg/config/config.schema.json (1)

1249-1281: Harden exporter schema: disallow unknown keys and (optionally) bound sizes.

Add additionalProperties: false under schema_usage.exporter to catch typos (e.g., “batchsize”). Optionally add conservative maximums to bound memory use.

-                    "exporter": {
-                      "type": "object",
-                      "description": "Configuration for the schema usage exporter",
+                    "exporter": {
+                      "type": "object",
+                      "description": "Configuration for the schema usage exporter",
+                      "additionalProperties": false,
                       "properties": {
                         "batch_size": {
                           "type": "integer",
                           "default": 4096,
                           "minimum": 1,
+                          "maximum": 1048576,
                           "description": "The maximum number of schema usage items to be applied in a single batch. The default value is 4096."
                         },
                         "queue_size": {
                           "type": "integer",
                           "default": 12800,
                           "minimum": 1,
+                          "maximum": 10485760,
                           "description": "The maximum number of schema usage items allowed in queue at a given time. The default value is 12800."
                         },
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0200c3c and 45c510b.

📒 Files selected for processing (5)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/config/config_test.go (1 hunks)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • router/pkg/config/testdata/config_defaults.json
  • router/pkg/config/config_test.go
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router/pkg/config/testdata/config_full.json
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/pkg/config/testdata/config_full.json
  • router/pkg/config/config.schema.json
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

  • router/pkg/config/testdata/config_full.json
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

  • router/pkg/config/config.schema.json
🧬 Code graph analysis (1)
router/pkg/config/config.go (3)
router/internal/graphqlmetrics/exporter.go (1)
  • Exporter (16-31)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
router-tests/testenv/testenv.go (1)
  • PrometheusSchemaFieldUsageExporter (285-290)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
router/pkg/config/testdata/config_full.json (1)

96-101: Fixture update aligns with new exporter config.

Values and duration encodings look correct. LGTM.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45c510b and 4b706fc.

📒 Files selected for processing (2)
  • router/core/context.go (2 hunks)
  • router/core/router_metrics.go (5 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.

Applied to files:

  • router/core/context.go
🧬 Code graph analysis (2)
router/core/router_metrics.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • GraphQLMetricsExporter (13-15)
router/internal/graphqlmetrics/prometheus_exporter.go (1)
  • PrometheusMetricsExporter (13-15)
router/core/context.go (4)
  • OperationType (503-503)
  • OperationTypeQuery (506-506)
  • OperationTypeMutation (507-507)
  • OperationTypeSubscription (508-508)
graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (16)
  • OperationType (26-26)
  • OperationType (58-60)
  • OperationType (62-64)
  • OperationType (71-73)
  • OperationType_QUERY (29-29)
  • OperationType_MUTATION (30-30)
  • OperationType_SUBSCRIPTION (31-31)
  • SchemaUsageInfo (130-153)
  • SchemaUsageInfo (168-168)
  • SchemaUsageInfo (183-185)
  • OperationInfo (362-373)
  • OperationInfo (388-388)
  • OperationInfo (403-405)
  • SchemaInfo (428-435)
  • SchemaInfo (450-450)
  • SchemaInfo (465-467)
router/core/context.go (2)
router/pkg/graphqlschemausage/schemausage.go (2)
  • TypeFieldMetrics (28-28)
  • TypeFieldUsageInfo (40-48)
graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (3)
  • TypeFieldUsageInfo (476-493)
  • TypeFieldUsageInfo (508-508)
  • TypeFieldUsageInfo (523-525)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build-router
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (8)
router/core/context.go (2)

546-549: Context for new caching field.

The addition of typeFieldUsageInfoMetrics as a cached conversion result aligns with the PR's goal to efficiently track all operations without sampling. This avoids repeated allocations when multiple exporters need the same data.


601-608: The review comment is incorrect and should be disregarded.

The premise of a race condition is based on a misunderstanding of the codebase. Here's why:

  1. operationContext is per-request: In graphql_prehandler.go line 229, a new operationContext instance is created fresh for each HTTP request: requestContext.operation = &operationContext{clientInfo: clientInfo}. This is a standard per-request context pattern.

  2. No concurrent access to the same instance: Since each request gets its own operationContext instance, different requests cannot race on the same instance's fields. The lazy initialization on lines 604-605 is only accessed by the single request handler processing that specific request.

  3. Confusion between shared data and shared context: The comment in router_metrics.go:100-102 refers to the cached plan data (typeFieldUsageInfo) being reused across requests via the planner cache. This is separate from the operationContext wrapper itself, which is always fresh per request. The two concepts are distinct.

Likely an incorrect or invalid review comment.

router/core/router_metrics.go (6)

17-19: LGTM: Clean interface extension for Prometheus support.

The interface changes clearly separate GraphQL and Prometheus export paths while maintaining backward compatibility with the updated GraphQLMetricsExporter type.


26-31: LGTM: Struct fields updated to support dual export paths.

The addition of prometheusMetricsExporter alongside the updated gqlMetricsExporter type enables the router to export metrics to both GraphQL and Prometheus sinks.


43-52: LGTM: Constructor properly initializes new fields.

All new fields are correctly initialized from the config.


57-69: LGTM: Prometheus tracking enabled conditionally.

Line 64 correctly sets PrometheusUsageInfoTrack based on exporter presence, ensuring tracking overhead is only incurred when the Prometheus exporter is configured.


75-81: LGTM: Accessor methods updated.

The accessor methods correctly return the new exporter types.


83-133: LGTM: Efficient use of cached metrics conversion.

Line 109 now uses GetTypeFieldUsageInfoMetrics() which avoids repeated allocations by caching the converted metrics data. This optimization aligns with the PR's goal to track all operations efficiently.

Note: See the comment on context.go lines 601-608 regarding potential race conditions in the caching implementation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/pkg/config/config.schema.json (1)

1236-1282: Lock down schema_usage to reject legacy SampleRate and unknown keys

Currently additionalProperties is not set on schema_usage, so legacy sample_rate may be silently accepted and ignored. Tighten it and keep exporter properties as-is.

-                "schema_usage": {
-                  "type": "object",
-                  "description": "Configure schema field usage metrics for Prometheus",
-                  "properties": {
+                "schema_usage": {
+                  "type": "object",
+                  "description": "Configure schema field usage metrics for Prometheus",
+                  "additionalProperties": false,
+                  "properties": {
                     "enabled": {

Optionally, add a custom pre-validation check to emit a clearer error if sample_rate is present.

♻️ Duplicate comments (1)
router/core/router_metrics.go (1)

19-19: Consider using GQLMetricsExporter for proper Go initialism formatting.

As noted in a previous review, Go's style guide recommends using all caps for initialisms like "GQL" rather than "Gql".

🧹 Nitpick comments (10)
router/internal/graphqlmetrics/exporter.go (2)

52-63: Clarify QueueSize comment

The queue holds items (T), not batches. Update the comment to avoid confusion.


382-416: Shutdown sequencing could prolong termination

Requests are canceled only in defer, and backoff sleeps are not cancelable. The diff above cancels earlier and makes sleeps cancelable to avoid long waits on shutdown.

router/pkg/config/config.schema.json (3)

1179-1185: Default mismatch: listen_addr

Schema default is "localhost:8088" but code (config.go Line 102) uses "127.0.0.1:8088" and defaults files do too. Align schema default.

-                  "default": "localhost:8088"
+                  "default": "127.0.0.1:8088"

462-466: Default + description mismatch: websocket.forward_upgrade_query_params.enabled

Schema sets default false, description says default true, and code default is true. Align to true.

-              "default": false,
-              "description": "Forward upgrade request query parameters in the extensions payload when starting a subscription on a Subgraph. The default value is true."
+              "default": true,
+              "description": "Forward upgrade request query parameters in the extensions payload when starting a subscription on a Subgraph. The default value is true."

633-637: Default mismatch: access_logs.add_stacktrace

Schema default true; code default is false (config.go Line 883). Align to false.

-          "default": true
+          "default": false
router/core/router_metrics.go (2)

140-170: Document the hash field choice.

The nil check at lines 141-143 properly addresses the defensive programming concern from the previous review. However, line 159 uses operationContext.sha256Hash (original operation hash) while the GraphQL export path uses operationContext.HashString() (normalized hash with variables).

According to the previous review analysis, this difference is intentional: Prometheus uses the original hash for operation correlation. Please add a brief comment at line 159 explaining this choice to prevent future confusion.

Apply this diff to document the hash semantics:

 	item := &graphqlmetricsv1.SchemaUsageInfo{
 		TypeFieldMetrics: operationContext.GetTypeFieldUsageInfoMetrics(),
 		OperationInfo: &graphqlmetricsv1.OperationInfo{
 			Type: opType,
-			Hash: operationContext.sha256Hash,
+			Hash: operationContext.sha256Hash, // Original operation hash for Prometheus correlation (vs normalized hash used in GraphQL export)
 			// parsed operation names are re-used across requests
 			// for that reason, we need to copy the name, or it might get corrupted
 			Name: strings.Clone(operationContext.name),
 		},

145-153: Consider extracting the repeated operation type mapping.

The operation type switch (lines 145-153) is duplicated from the GraphQL export method (lines 93-101). Extracting this into a helper function would improve maintainability.

Example refactor:

func operationTypeToProto(opType OperationType) graphqlmetricsv1.OperationType {
	switch opType {
	case OperationTypeQuery:
		return graphqlmetricsv1.OperationType_QUERY
	case OperationTypeMutation:
		return graphqlmetricsv1.OperationType_MUTATION
	case OperationTypeSubscription:
		return graphqlmetricsv1.OperationType_SUBSCRIPTION
	default:
		return graphqlmetricsv1.OperationType_QUERY
	}
}

Then use opType := operationTypeToProto(operationContext.opType) in both export methods.

router/pkg/graphqlschemausage/schemausage.go (3)

64-104: Type field visitor preallocation and path handling look correct

Preallocating typeFieldUsageInfo with a small fixed capacity and switching to pathCopy := make(len(path)+1) plus copy avoids the extra allocation of slices.Clone(append(...)) while still guaranteeing that:

  • path == nil is handled safely (len(path) is 0, copy is no-op).
  • Each pathCopy is independent and only shared intentionally between the direct field entry and its IndirectInterfaceField variant.
  • Recursion passes pathCopy down and always re-allocates for children, so parent paths are never mutated.

The constants (capacity 32 and per-field path allocations) are modest and appropriate for typical query shapes; if future workloads show much larger field counts, benchmarks in this package can guide tuning, but there’s no correctness or obvious perf risk here.


107-131: Argument usage slice preallocation is safe and aligns with usage pattern

Initializing visitor.usage with make([]*graphqlmetrics.ArgumentUsageInfo, 0, 16) is a straightforward allocation win for common cases and doesn’t change behavior; all appends still produce the same sequence of entries as before, and the walker lifecycle remains unchanged.

If production traces show higher argument counts per operation, you can revisit the 16 heuristic, but as-is this is a low-risk, localized improvement.


159-244: Input usage preallocation and enum handling changes preserve semantics

The new preallocations in input usage look good:

  • usage: make([]*graphqlmetrics.InputUsageInfo, 0, 16) in GetInputUsageInfo matches the typical small number of variables and reduces slice growth.
  • In traverseVariable, setting usageInfo.Path = []string{parentTypeName, fieldName} when parentTypeName != "" allocates an exact-length slice and is equivalent to the previous append-based construction, while avoiding extra capacity and potential future aliasing.
  • For enum arrays, usageInfo.EnumValues = make([]string, len(arr)) followed by index assignment is correct and avoids repeated append calls.

These changes don’t alter how appendUniqueUsage and infoEquals behave, so deduplication semantics remain the same while allocations are slightly reduced.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86480f7 and bb6d441.

📒 Files selected for processing (9)
  • router/core/router_metrics.go (5 hunks)
  • router/internal/graphqlmetrics/exporter.go (5 hunks)
  • router/internal/graphqlmetrics/exporter_bench_test.go (1 hunks)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (1 hunks)
  • router/pkg/graphqlschemausage/schemausage.go (6 hunks)
  • router/pkg/graphqlschemausage/schemausage_bench_test.go (1 hunks)
🧰 Additional context used
🧠 Learnings (13)
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/pkg/config/testdata/config_defaults.json
  • router/pkg/config/config.schema.json
  • router/core/router_metrics.go
  • router/pkg/config/testdata/config_full.json
  • router/pkg/graphqlschemausage/schemausage.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

  • router/pkg/config/config.schema.json
  • router/core/router_metrics.go
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-09-02T12:52:27.677Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function at lines 571-578, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the computed query hash before any APQ operations occur. There's also a test case that verifies this behavior.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

  • router/core/router_metrics.go
  • router/pkg/config/testdata/config_full.json
📚 Learning: 2025-09-02T12:52:27.677Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the query body before any APQ operations occur.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-12T13:50:45.964Z
Learnt from: Noroth
Repo: wundergraph/cosmo PR: 2132
File: router-plugin/plugin.go:139-146
Timestamp: 2025-08-12T13:50:45.964Z
Learning: In the Cosmo router plugin system, the plugin framework creates its own logger independently. When creating a logger in NewRouterPlugin, it's only needed for the gRPC server setup (passed via setup.GrpcServerInitOpts.Logger) and doesn't need to be assigned back to serveConfig.Logger.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router/pkg/config/testdata/config_full.json
🧬 Code graph analysis (6)
router/internal/graphqlmetrics/exporter.go (1)
router/internal/graphqlmetrics/sink.go (2)
  • Sink (9-18)
  • SinkErrorHandler (23-23)
router/pkg/config/config.go (2)
router/internal/graphqlmetrics/exporter.go (1)
  • Exporter (17-33)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
router/core/router_metrics.go (5)
router/core/operation_metrics.go (1)
  • OperationMetrics (31-40)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • GraphQLMetricsExporter (13-15)
router/internal/graphqlmetrics/prometheus_exporter.go (1)
  • PrometheusMetricsExporter (13-15)
router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (10)
  • OperationType (26-26)
  • OperationType (58-60)
  • OperationType (62-64)
  • OperationType (71-73)
  • OperationType_QUERY (29-29)
  • OperationType_MUTATION (30-30)
  • OperationType_SUBSCRIPTION (31-31)
  • SchemaUsageInfo (130-153)
  • SchemaUsageInfo (168-168)
  • SchemaUsageInfo (183-185)
router/core/context.go (4)
  • OperationType (503-503)
  • OperationTypeQuery (506-506)
  • OperationTypeMutation (507-507)
  • OperationTypeSubscription (508-508)
router/internal/graphqlmetrics/exporter_bench_test.go (1)
router/internal/graphqlmetrics/exporter.go (3)
  • ExporterSettings (52-63)
  • RetryOptions (35-40)
  • NewExporter (84-120)
router/pkg/graphqlschemausage/schemausage_bench_test.go (2)
router/pkg/graphqlschemausage/schemausage_test.go (1)
  • FakeFactory (527-529)
router/pkg/graphqlschemausage/schemausage.go (4)
  • GetTypeFieldUsageInfo (14-23)
  • GetArgumentUsageInfo (107-124)
  • GetInputUsageInfo (159-171)
  • TypeFieldMetrics (26-26)
router/pkg/graphqlschemausage/schemausage.go (1)
graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (9)
  • TypeFieldUsageInfo (476-493)
  • TypeFieldUsageInfo (508-508)
  • TypeFieldUsageInfo (523-525)
  • ArgumentUsageInfo (569-582)
  • ArgumentUsageInfo (597-597)
  • ArgumentUsageInfo (612-614)
  • InputUsageInfo (644-659)
  • InputUsageInfo (674-674)
  • InputUsageInfo (689-691)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_test
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: build_test
🔇 Additional comments (10)
router/internal/graphqlmetrics/exporter_bench_test.go (1)

27-66: Benchmarks cover key hot paths; LGTM

Good coverage of buffer pooling, high-throughput, batch cycling, growth, and parallel Record. Minor: time.Sleep in benchmarks can add variance, but acceptable here.

Also applies to: 68-111, 112-156, 157-197, 198-242

router/pkg/config/testdata/config_defaults.json (1)

66-71: Defaults alignment looks good

Exporter defaults match code/env defaults (4096/12800/1s/10s). LGTM.

router/pkg/config/testdata/config_full.json (1)

96-101: Config shape migration verified

Exporter object replaces SampleRate as intended; values consistent with defaults/examples. LGTM.

router/pkg/config/config.go (1)

116-119: ****

The two PrometheusSchemaFieldUsageExporter definitions serve distinct purposes and should not be unified:

  • router/pkg/config/config.go:121 is a configuration struct with YAML and environment variable tags for unmarshaling external configuration with defaults
  • router/pkg/metric/config.go:48 is an internal runtime struct used by the metrics package to track exporter state, with no serialization concerns

This separation of concerns is intentional and appropriate. Consolidating them would couple the metric package to configuration unmarshaling logic and add unnecessary struct tags where they don't belong. The apparent duplication is justified.

Likely an incorrect or invalid review comment.

router/core/router_metrics.go (4)

3-5: Good use of standard library.

The strings package import supports the replacement of the custom strCopy implementation with strings.Clone(), which is a cleaner approach.


24-53: LGTM!

The struct fields and constructor are updated consistently. The new prometheusMetricsExporter field is properly initialized in NewRouterMetrics.


76-82: LGTM!

The getter methods correctly expose both exporters through the RouterMetrics interface.


84-138: LGTM!

The method improvements are well-executed:

  • Added defensive nil check for the exporter (lines 89-91)
  • Replaced custom strCopy with standard strings.Clone (line 122)
  • Updated to use GetTypeFieldUsageInfoMetrics() for cleaner data access
router/pkg/graphqlschemausage/schemausage_bench_test.go (2)

106-194: Benchmarks are well-structured and focus on the right hot paths

Each benchmark:

  • Calls setupBenchmark once, then uses b.ResetTimer() and b.ReportAllocs() to keep setup out of the measured section.
  • Exercises a single concern per benchmark (GetTypeFieldUsageInfo, GetArgumentUsageInfo, GetInputUsageInfo, IntoGraphQLMetrics, and the combined end-to-end flow), with errors surfaced via b.Fatal.
  • Prevents compiler optimizations by assigning results to a blank variable.

The end-to-end benchmark intentionally reuses the precomputed plan, docs, and variables to focus on schema usage extraction rather than planning, which matches the PR’s perf focus. This all looks good and should give you stable allocation/CPU signals for regression tracking.


19-104: No issues found; original concern is unfounded.

The pattern of calling astjson.ParseBytes(op.Input.Variables) is safe and well-established. The Variables field originates as json.RawMessage (a []byte type with omitempty tag), which defaults to an empty slice rather than nil when unmarshaled. This empty slice is valid JSON input for astjson.ParseBytes. The identical pattern already exists and is tested in schemausage_test.go (line 206) with successful results, confirming the code path is sound.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/internal/graphqlmetrics/exporter_test.go (1)

70-71: Minor: prefer require.NoError for errors

Switch to require.NoError for clarity. Applies to all similar occurrences.

- require.Nil(t, err)
+ require.NoError(t, err)

Also applies to: 149-150, 188-189, 222-223, 345-346, 421-422

♻️ Duplicate comments (3)
router/internal/graphqlmetrics/exporter.go (3)

56-57: Doc nit: QueueSize counts items, not batches

The queue holds items (chan T), not batches. Update comment for accuracy.

- // QueueSize is the maximum number of batches allowed in queue at a given time.
+ // QueueSize is the maximum number of items allowed in the queue at a given time.

272-290: Backoff sleep not cancelable; goroutines may linger on shutdown

Make backoff waits respect e.exportRequestContext so Shutdown can exit promptly.

-    // Wait for the specified backoff period
-    sleepDuration := b.Duration()
-    e.logger.Debug("Retrying export after backoff",
+    // Wait for the specified backoff period (cancelable)
+    sleepDuration := b.Duration()
+    e.logger.Debug("Retrying export after backoff",
       zap.Int("batch_size", len(batch)),
       zap.Int("retry", retry),
       zap.Duration("sleep", sleepDuration),
     )
-
-    time.Sleep(sleepDuration)
+    select {
+    case <-e.exportRequestContext.Done():
+      e.logger.Debug("Export cancelled during backoff")
+      return
+    case <-time.After(sleepDuration):
+    }

381-394: Shutdown can hang; cancel exports earlier

Cancel export requests before waiting, not in defer, so sleeping/blocked exports exit promptly.

 func (e *Exporter[T]) Shutdown(ctx context.Context) error {
   e.logger.Debug("Shutdown started")
 
   ticker := time.NewTicker(time.Millisecond * 100)
-  defer func() {
-    ticker.Stop()
-    // Cancel all export requests
-    e.cancelAllExportRequests()
-    // Close the sink
-    if err := e.sink.Close(ctx); err != nil {
-      e.logger.Error("Error closing sink", zap.Error(err))
-    }
-    e.logger.Debug("Shutdown complete")
-  }()
+  defer func() {
+    ticker.Stop()
+    // Close the sink
+    if err := e.sink.Close(ctx); err != nil {
+      e.logger.Error("Error closing sink", zap.Error(err))
+    }
+    e.logger.Debug("Shutdown complete")
+  }()
 
   // First close the acceptTrafficSema to stop accepting new items
   close(e.acceptTrafficSema)
   // Then trigger the shutdown signal for the exporter goroutine to stop
   // It will drain the queue and send the remaining items
   close(e.shutdownSignal)
+  // Cancel all export requests to unblock backoff/sinks started before shutdown
+  e.cancelAllExportRequests()
🧹 Nitpick comments (5)
router/internal/graphqlmetrics/prometheus_sink.go (1)

104-106: Use int64 for counts to match metric API and avoid unnecessary casts

Store counts as int64; remove per-iteration int conversions.

-func (s *PrometheusSink) aggregateBatch(batch []*graphqlmetrics.SchemaUsageInfo) map[aggregatedUsageKey]int {
-	aggregatedCounts := make(map[aggregatedUsageKey]int)
+func (s *PrometheusSink) aggregateBatch(batch []*graphqlmetrics.SchemaUsageInfo) map[aggregatedUsageKey]int64 {
+	aggregatedCounts := make(map[aggregatedUsageKey]int64)
@@
-			// Increment count, using field.Count if available, otherwise 1
-			if field.Count > 0 {
-				aggregatedCounts[key] += int(field.Count)
-			} else {
-				aggregatedCounts[key]++
-			}
+			// Increment count, using field.Count if available, otherwise 1
+			if field.Count > 0 {
+				aggregatedCounts[key] += int64(field.Count)
+			} else {
+				aggregatedCounts[key] += 1
+			}

And in Export:

-	for key, count := range aggregatedCounts {
+	for key, count := range aggregatedCounts {
@@
-			int64(count),
+			count,

Also applies to: 140-145, 81-87

router/demo.config.yaml (1)

16-25: LGTM: concise demo of Prometheus schema_usage

Sample config is coherent. Optional: add a comment noting include_operation_sha increases label cardinality.

router/internal/graphqlmetrics/exporter_test.go (1)

385-391: Reduce flakiness: prefer Eventually over fixed Sleep

Replace fixed sleep with Eventually to wait for flush deterministically.

- time.Sleep(200 * time.Millisecond)
- 
- defer require.Nil(t, e.Shutdown(context.Background()))
- 
- require.Equal(t, 1, len(c.publishedAggregations))
- require.Equal(t, 5, len(c.publishedAggregations[0]))
+ require.Eventually(t, func() bool {
+   c.mu.Lock()
+   defer c.mu.Unlock()
+   return len(c.publishedAggregations) == 1 && len(c.publishedAggregations[0]) == 5
+ }, 2*time.Second, 50*time.Millisecond)
+ defer require.NoError(t, e.Shutdown(context.Background()))
router/pkg/config/config.schema.json (1)

1249-1281: Tighten exporter schema and fix unit doc

Disallow unknown keys and include 'ms' in interval docs to match usage.

-                    "exporter": {
-                      "type": "object",
-                      "description": "Configuration for the schema usage exporter",
-                      "properties": {
+                    "exporter": {
+                      "type": "object",
+                      "description": "Configuration for the schema usage exporter",
+                      "additionalProperties": false,
+                      "properties": {
@@
-                        "interval": {
+                        "interval": {
                           "type": "string",
                           "default": "2s",
                           "duration": {
                             "minimum": "100ms"
                           },
-                          "description": "The interval at which the schema usage queue is flushed. The period is specified as a string with a number and a unit, e.g. 10s, 1m. The supported units are 's', 'm', 'h'."
+                          "description": "The interval at which the schema usage queue is flushed. Specify a duration, e.g. 100ms, 10s, 1m. Supported units: 'ms', 's', 'm', 'h'."
                         },
router/internal/graphqlmetrics/exporter.go (1)

107-114: *Prefer pooling []T (not []T) to simplify and avoid pointer indirection

Safer and simpler: pool raw slices; ownership already fixed by deferring put in goroutine.

-    batchBufferPool: &sync.Pool{
-      New: func() any {
-        // Pre-allocate slice with batch size capacity
-        buffer := make([]T, 0, settings.BatchSize)
-        return &buffer
-      },
-    },
+    batchBufferPool: &sync.Pool{
+      New: func() any {
+        return make([]T, 0, settings.BatchSize)
+      },
+    },
@@
-func (e *Exporter[T]) getBatchBuffer() []T {
-  bufferPtr := e.batchBufferPool.Get().(*[]T)
-  buffer := *bufferPtr
-  // Ensure the buffer is empty (should already be, but be defensive)
-  return buffer[:0]
-}
+func (e *Exporter[T]) getBatchBuffer() []T {
+  bufAny := e.batchBufferPool.Get()
+  if bufAny == nil {
+    return make([]T, 0, e.settings.BatchSize)
+  }
+  buf := bufAny.([]T)
+  return buf[:0]
+}
@@
-func (e *Exporter[T]) putBatchBuffer(buffer []T) {
-  // Reset the slice to zero length while keeping capacity
-  buffer = buffer[:0]
-  e.batchBufferPool.Put(&buffer)
-}
+func (e *Exporter[T]) putBatchBuffer(buf []T) {
+  e.batchBufferPool.Put(buf[:0])
+}

Also applies to: 154-161, 163-169

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb6d441 and 841d22a.

📒 Files selected for processing (10)
  • router/demo.config.yaml (2 hunks)
  • router/internal/graphqlmetrics/exporter.go (5 hunks)
  • router/internal/graphqlmetrics/exporter_test.go (10 hunks)
  • router/internal/graphqlmetrics/prometheus_sink.go (1 hunks)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/config/config_test.go (1 hunks)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (1 hunks)
  • router/pkg/graphqlschemausage/schemausage.go (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • router/pkg/config/config_test.go
  • router/pkg/graphqlschemausage/schemausage.go
  • router/pkg/config/testdata/config_full.json
  • router/pkg/config/testdata/config_defaults.json
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router/internal/graphqlmetrics/exporter_test.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

  • router/pkg/config/config.schema.json
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/pkg/config/config.schema.json
🧬 Code graph analysis (4)
router/internal/graphqlmetrics/exporter_test.go (2)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • NewGraphQLMetricsExporter (19-39)
router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (3)
  • RequestInfo (75-82)
  • RequestInfo (97-97)
  • RequestInfo (112-114)
router/pkg/config/config.go (2)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
router-tests/testenv/testenv.go (1)
  • PrometheusSchemaFieldUsageExporter (285-290)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
  • Store (149-162)
router/pkg/otel/attributes.go (5)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
  • WgOperationSha256 (59-59)
  • WgGraphQLFieldName (60-60)
  • WgGraphQLParentType (61-61)
router/pkg/graphqlschemausage/schemausage.go (1)
  • TypeFieldMetrics (26-26)
router/internal/graphqlmetrics/exporter.go (1)
router/internal/graphqlmetrics/sink.go (2)
  • Sink (9-18)
  • SinkErrorHandler (23-23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./events)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: build_test
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
router/pkg/config/config.go (1)

116-119: LGTM: exporter config shape and defaults

Exporter block and env tags look consistent with schema and intended wiring.

Also applies to: 121-126

router/internal/graphqlmetrics/prometheus_sink.go (1)

121-131: Avoid potential nil deref on field entries

TypeFieldMetrics is a slice of pointers; a nil element will panic on Path/TypeNames access.

-		for _, field := range usageInfo.TypeFieldMetrics {
+		for _, field := range usageInfo.TypeFieldMetrics {
+			if field == nil {
+				continue
+			}
 			// Skip fields without valid parent type or path
 			if len(field.Path) == 0 || len(field.TypeNames) < 1 {
 				continue
 			}
⛔ Skipped due to learnings
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

@endigma endigma self-assigned this Nov 20, 2025
@endigma endigma force-pushed the dustin/prometheus-sink-for-gqlmetrics branch from 0cf3876 to 307df0c Compare November 20, 2025 15:13
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
router/internal/exporter/exporter.go (2)

26-33: Tighten shutdown/backoff behavior and simplify the batch buffer pool

Two areas are worth tightening up here:

  1. Shutdown responsiveness under long backoff
    • exportBatchWithRetry uses time.Sleep(sleepDuration) in the retry loop, and Shutdown only calls cancelAllExportRequests() in a defer after waiting for inflightBatches to reach 0. Under large MaxDuration or slow sinks, shutdown can be substantially delayed because those sleeps are not interruptible.
    • You can make retries respect shutdown by:
      • Cancelling exportRequestContext at the start of Shutdown (not only in the defer).
      • Making the backoff wait cancelable:
 func (e *Exporter[T]) exportBatchWithRetry(batch []T) {
@@
-	for retry < e.settings.RetryOptions.MaxRetry {
+	for retry < e.settings.RetryOptions.MaxRetry {
 		retry++
@@
-		// Wait for the specified backoff period
-		sleepDuration := b.Duration()
-		e.logger.Debug("Retrying export after backoff",
+		// Wait for the specified backoff period (cancelable)
+		sleepDuration := b.Duration()
+		e.logger.Debug("Retrying export after backoff",
 			zap.Int("batch_size", len(batch)),
 			zap.Int("retry", retry),
 			zap.Duration("sleep", sleepDuration),
 		)
-
-		time.Sleep(sleepDuration)
+		select {
+		case <-e.exportRequestContext.Done():
+			e.logger.Debug("Export cancelled during backoff")
+			return
+		case <-time.After(sleepDuration):
+		}
@@
 func (e *Exporter[T]) Shutdown(ctx context.Context) error {
 	e.logger.Debug("Shutdown started")
 
-	ticker := time.NewTicker(time.Millisecond * 100)
-	defer func() {
-		ticker.Stop()
-		// Cancel all export requests
-		e.cancelAllExportRequests()
+	// Cancel export requests early to unblock long-running exports/backoff
+	e.cancelAllExportRequests()
+
+	ticker := time.NewTicker(time.Millisecond * 100)
+	defer func() {
+		ticker.Stop()
 		// Close the sink
 		if err := e.sink.Close(ctx); err != nil {
 			e.logger.Error("Error closing sink", zap.Error(err))
 		}
 		e.logger.Debug("Shutdown complete")
 	}()
  1. Batch buffer pool ergonomics
    • The current pool stores *[]T and re-wraps slice headers via getBatchBuffer / putBatchBuffer. This works, but it’s less idiomatic and slightly harder to reason about than pooling []T values directly.
    • Consider switching the pool to hold []T values and dropping the pointer dance:
-	batchBufferPool   *sync.Pool // Pool for batch slice buffers to reduce allocations
+	batchBufferPool   *sync.Pool // Pool of []T to reduce allocations
@@
-		batchBufferPool: &sync.Pool{
-			New: func() any {
-				// Pre-allocate slice with batch size capacity
-				buffer := make([]T, 0, settings.BatchSize)
-				return &buffer
-			},
-		},
+		batchBufferPool: &sync.Pool{
+			New: func() any {
+				// Pre-allocate slice with batch size capacity
+				return make([]T, 0, settings.BatchSize)
+			},
+		},
@@
 func (e *Exporter[T]) getBatchBuffer() []T {
-	bufferPtr := e.batchBufferPool.Get().(*[]T)
-	buffer := *bufferPtr
-	// Ensure the buffer is empty (should already be, but be defensive)
-	return buffer[:0]
+	bufAny := e.batchBufferPool.Get()
+	if bufAny == nil {
+		return make([]T, 0, e.settings.BatchSize)
+	}
+	buf := bufAny.([]T)
+	return buf[:0]
 }
@@
 func (e *Exporter[T]) putBatchBuffer(buffer []T) {
-	// Reset the slice to zero length while keeping capacity
-	buffer = buffer[:0]
-	e.batchBufferPool.Put(&buffer)
+	e.batchBufferPool.Put(buffer[:0])
 }
@@
 func (e *Exporter[T]) prepareAndSendBatch(batch []T) {
@@
-	go func() {
+	go func(b []T) {
 		defer e.inflightBatches.Dec()
-		defer e.putBatchBuffer(batch) // Return buffer to pool after export completes
-		e.exportBatchWithRetry(batch)
-	}()
+		defer e.putBatchBuffer(b) // Return buffer to pool after export completes
+		e.exportBatchWithRetry(b)
+	}(batch)

This keeps the concurrent semantics you already fixed (buffer returned only after export completes) while making shutdown behavior more predictable and the pool implementation easier to follow.

Based on learnings

Also applies to: 107-114, 154-170, 230-310, 377-413


122-152: Gate retry validation behind RetryOptions.Enabled to allow clean disablement

As written, validate() rejects zero/negative MaxDuration, Interval, and MaxRetry even when retries are disabled. This makes common patterns like “enabled=false with zero-valued retry fields” invalid, and it diverges from the repo’s existing behavior of not over-validating options that are turned off (e.g., retry transport config). Consider only enforcing these constraints when retries are actually enabled.

 func (e *Exporter[T]) validate() error {
@@
-	if e.settings.ExportTimeout <= 0 {
-		return fmt.Errorf("export timeout must be positive")
-	}
-
-	if e.settings.RetryOptions.MaxDuration <= 0 {
-		return fmt.Errorf("retry max duration must be positive")
-	}
-
-	if e.settings.RetryOptions.Interval <= 0 {
-		return fmt.Errorf("retry interval must be positive")
-	}
-
-	if e.settings.RetryOptions.MaxRetry <= 0 {
-		return fmt.Errorf("retry max retry must be positive")
-	}
+	if e.settings.ExportTimeout <= 0 {
+		return fmt.Errorf("export timeout must be positive")
+	}
+
+	if e.settings.RetryOptions.Enabled {
+		if e.settings.RetryOptions.MaxDuration <= 0 {
+			return fmt.Errorf("retry max duration must be positive")
+		}
+		if e.settings.RetryOptions.Interval <= 0 {
+			return fmt.Errorf("retry interval must be positive")
+		}
+		if e.settings.RetryOptions.MaxRetry <= 0 {
+			return fmt.Errorf("retry max retry must be positive")
+		}
+	}
@@
 	return nil
 }

This lets config set enabled: false with zero defaults without tripping validation, while still protecting enabled retries from bad values.

Based on learnings

router/internal/graphqlmetrics/prometheus_sink.go (1)

39-45: Guard against nil Logger in NewPrometheusSink to avoid panics

cfg.Logger.With(...) will panic if cfg.Logger is nil. This was already raised in a previous review and is still present.

Initialize a local logger with a zap.NewNop() fallback before calling With:

 func NewPrometheusSink(cfg PrometheusSinkConfig) *PrometheusSink {
-	return &PrometheusSink{
-		metricStore:  cfg.MetricStore,
-		logger:       cfg.Logger.With(zap.String("component", "prometheus_sink")),
-		includeOpSha: cfg.IncludeOpSha,
-	}
+	logger := cfg.Logger
+	if logger == nil {
+		logger = zap.NewNop()
+	}
+	return &PrometheusSink{
+		metricStore:  cfg.MetricStore,
+		logger:       logger.With(zap.String("component", "prometheus_sink")),
+		includeOpSha: cfg.IncludeOpSha,
+	}
 }
router/core/router_metrics.go (1)

140-170: Prometheus schema usage export path looks correct; consider documenting hash choice.

The Prometheus-specific exporter mirrors the GraphQL path, is defensively guarded on prometheusMetricsExporter == nil, and uses GetTypeFieldUsageInfoMetrics plus strings.Clone to avoid mutating shared data. Using operationContext.sha256Hash here (vs. HashString() in the GraphQL exporter) aligns with the distinction between original vs. normalized hashes; adding a short comment to that effect would make the intentional difference clearer to future readers.

-			Hash: operationContext.sha256Hash,
+			// Use the original operation hash for Prometheus correlation; GraphQL exporter uses the
+			// normalized analytics hash via HashString().
+			Hash: operationContext.sha256Hash,
🧹 Nitpick comments (6)
router/pkg/metric/config.go (1)

48-53: Consider consolidating duplicate type definitions.

The PrometheusSchemaFieldUsageExporter type is defined identically in both router/pkg/config and router/pkg/metric packages (and also in router-tests/testenv). While this separation may be intentional for package isolation, it creates maintenance overhead when fields need to be updated.

Consider:

  • Defining the type once in a shared package
  • Or using type aliases to reference a single source of truth

This is a minor maintainability concern and can be addressed in a future refactor if desired.

router/pkg/config/config.schema.json (1)

1249-1280: Clarify exporter.interval units and queue_size semantics in schema description

The exporter config itself looks correct and matches PrometheusSchemaFieldUsageExporter, but the descriptions are slightly misleading:

  • Line 1269 sets a minimum of "100ms" while the description lists supported units as only 's', 'm', 'h'. Either mention 'ms' as supported or raise the minimum to 1s to match the text.
  • Line 1263 describes queue_size as “number of schema usage items allowed in queue”; in the generic exporter, QueueSize is actually the number of batches in the internal queue, each up to batch_size. Consider rephrasing to “batches” (or “batches of up to batch_size items”) so users can reason about memory/cardinality correctly.
router-tests/prometheus_improved_test.go (2)

82-84: Consider replacing fixed time.Sleep with a poll/helper to reduce flakiness

All these tests wait for exporter flushes using fixed sleeps (200ms or 100ms). With parallel subtests and CI variance, this can intermittently be too short or unnecessarily long.

Consider a small helper that polls promRegistry.Gather() until the expected metric family appears or a timeout elapses, instead of relying on fixed Sleep durations. This will make the tests more robust to timing jitter while still exercising the exporter behavior.

Also applies to: 166-168, 229-231, 315-317, 359-361, 413-415


177-199: Ordering assumptions in multiple usages of field are counted may be brittle

This test assumes a specific ordering of schemaUsageMetrics when asserting on [0], [1], [2]. With the new aggregation path and Prometheus sink, the series order is effectively an implementation detail and may change.

To avoid order-dependent flakiness, consider locating the three metrics by (field_name, parent_type) label pairs (similar to the fieldTypePairs map used in the first subtest) and asserting the counters on those, instead of relying on positions 0/1/2.

router/core/router.go (1)

33-33: GraphQL metrics exporter wiring looks correct; consider future configurability.

Using graphqlmetrics.NewGraphQLMetricsExporter with exporter.NewDefaultExporterSettings() is a clean migration to the generic exporter and keeps behavior predictable. If you later want to tune batch/queue/interval per deployment, this spot is the natural hook to pass settings derived from config instead of the hardcoded defaults.

Also applies to: 822-833

router/demo.config.yaml (1)

8-25: Demo config matches new telemetry and schema_usage model.

The added graphql_metrics, traffic_shaping.all.retry, and telemetry.metrics.{otlp,prometheus.schema_usage} sections are consistent with the updated config structs and MetricConfigFromTelemetry wiring; enabling prometheus.schema_usage with otlp.enabled: false clearly showcases the new Prometheus sink path. You might optionally add commented-out batch_size, queue_size, and export_timeout under schema_usage.exporter to advertise those knobs in the demo file.

Also applies to: 40-40

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 841d22a and 8e24196.

📒 Files selected for processing (26)
  • router-tests/prometheus_improved_test.go (16 hunks)
  • router-tests/testenv/testenv.go (3 hunks)
  • router/core/context.go (2 hunks)
  • router/core/graph_server.go (4 hunks)
  • router/core/operation_metrics.go (3 hunks)
  • router/core/router.go (4 hunks)
  • router/core/router_config.go (1 hunks)
  • router/core/router_metrics.go (5 hunks)
  • router/demo.config.yaml (2 hunks)
  • router/internal/exporter/exporter.go (1 hunks)
  • router/internal/exporter/exporter_bench_test.go (1 hunks)
  • router/internal/exporter/sink.go (1 hunks)
  • router/internal/graphqlmetrics/exporter.go (0 hunks)
  • router/internal/graphqlmetrics/graphql_exporter.go (1 hunks)
  • router/internal/graphqlmetrics/graphql_exporter_test.go (17 hunks)
  • router/internal/graphqlmetrics/graphql_metrics_sink.go (1 hunks)
  • router/internal/graphqlmetrics/prometheus_exporter.go (1 hunks)
  • router/internal/graphqlmetrics/prometheus_sink.go (1 hunks)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/config/config_test.go (1 hunks)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (1 hunks)
  • router/pkg/graphqlschemausage/schemausage.go (7 hunks)
  • router/pkg/graphqlschemausage/schemausage_bench_test.go (1 hunks)
  • router/pkg/metric/config.go (2 hunks)
💤 Files with no reviewable changes (1)
  • router/internal/graphqlmetrics/exporter.go
🚧 Files skipped from review as they are similar to previous changes (7)
  • router/core/router_config.go
  • router/pkg/graphqlschemausage/schemausage.go
  • router/pkg/config/config_test.go
  • router/internal/graphqlmetrics/graphql_metrics_sink.go
  • router/internal/graphqlmetrics/graphql_exporter.go
  • router/core/context.go
  • router/pkg/config/testdata/config_full.json
🧰 Additional context used
🧠 Learnings (25)
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

  • router/pkg/config/config.schema.json
  • router/internal/exporter/exporter.go
  • router/core/router_metrics.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/pkg/config/config.schema.json
  • router/core/router_metrics.go
  • router/pkg/config/testdata/config_defaults.json
  • router/core/operation_metrics.go
📚 Learning: 2025-08-26T17:19:28.178Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.

Applied to files:

  • router/pkg/metric/config.go
  • router/core/operation_metrics.go
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router/internal/graphqlmetrics/graphql_exporter_test.go
  • router-tests/prometheus_improved_test.go
  • router/internal/exporter/exporter.go
  • router/core/router_metrics.go
  • router-tests/testenv/testenv.go
  • router/core/operation_metrics.go
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

  • router-tests/prometheus_improved_test.go
  • router/core/router_metrics.go
  • router/core/graph_server.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.

Applied to files:

  • router/internal/exporter/exporter.go
  • router/core/operation_metrics.go
📚 Learning: 2025-08-28T10:07:27.650Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.

Applied to files:

  • router/internal/exporter/exporter.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().

Applied to files:

  • router/internal/exporter/exporter.go
  • router/core/operation_metrics.go
📚 Learning: 2025-09-17T20:50:58.518Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/internal/expr/retry_expression.go:63-67
Timestamp: 2025-09-17T20:50:58.518Z
Learning: In router/internal/expr/retry_expression.go, the maintainer SkArchon accepts the "silent no-retry" behavior when a retry expression program isn't precompiled (returning false, nil on cache miss). This is considered acceptable because config reloading is intended as a dev feature, making the potential footgun less problematic in the expected usage context.

Applied to files:

  • router/internal/exporter/exporter.go
📚 Learning: 2025-08-26T17:26:47.012Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2167
File: router/internal/retrytransport/retry_transport.go:84-90
Timestamp: 2025-08-26T17:26:47.012Z
Learning: In the Cosmo router retry system, MaxDuration has a default value of 15 seconds in config defaults. When MaxDuration is explicitly set to 0, it represents a user choice to disable retry duration limits entirely, so clamping Retry-After durations to 0 in this case is intentional behavior.

Applied to files:

  • router/internal/exporter/exporter.go
📚 Learning: 2025-08-26T17:26:47.012Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2167
File: router/internal/retrytransport/retry_transport.go:84-90
Timestamp: 2025-08-26T17:26:47.012Z
Learning: In the Cosmo router retry system, MaxDuration has a default value of 10 seconds (envDefault:"10s") in the config. When MaxDuration is explicitly set to 0, it represents a user choice to disable retry duration limits entirely, so clamping Retry-After durations to 0 in this case is intentional behavior.

Applied to files:

  • router/internal/exporter/exporter.go
📚 Learning: 2025-09-01T13:26:16.284Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/retry_builder.go:13-15
Timestamp: 2025-09-01T13:26:16.284Z
Learning: In router/core/retry_builder.go, the BuildRetryFunction's manager parameter is guaranteed to be non-nil by the calling code, so defensive nil checks are not needed according to the team's preference.

Applied to files:

  • router/internal/exporter/exporter.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router/core/router_metrics.go
  • router/core/router.go
  • router/core/graph_server.go
  • router/internal/graphqlmetrics/prometheus_exporter.go
📚 Learning: 2025-09-02T12:52:27.677Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function at lines 571-578, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the computed query hash before any APQ operations occur. There's also a test case that verifies this behavior.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-09-02T12:52:27.677Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the query body before any APQ operations occur.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

  • router/core/router_metrics.go
  • router/core/operation_metrics.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router/core/router_metrics.go
  • router/core/graph_server.go
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

  • router/core/router_metrics.go
  • router/core/graph_server.go
📚 Learning: 2025-08-12T13:50:45.964Z
Learnt from: Noroth
Repo: wundergraph/cosmo PR: 2132
File: router-plugin/plugin.go:139-146
Timestamp: 2025-08-12T13:50:45.964Z
Learning: In the Cosmo router plugin system, the plugin framework creates its own logger independently. When creating a logger in NewRouterPlugin, it's only needed for the gRPC server setup (passed via setup.GrpcServerInitOpts.Logger) and doesn't need to be assigned back to serveConfig.Logger.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.

Applied to files:

  • router/core/router_metrics.go
  • router/core/graph_server.go
📚 Learning: 2025-09-29T10:28:07.122Z
Learnt from: dkorittki
Repo: wundergraph/cosmo PR: 2192
File: router/pkg/config/config.go:1028-1029
Timestamp: 2025-09-29T10:28:07.122Z
Learning: The deprecation strategy for IntrospectionEnabled field in router/pkg/config/config.go is to first mark it as deprecated, then migrate all call sites to use IntrospectionConfig.Enabled, and finally remove the deprecated field. The envDefault tag should be kept during the deprecation period for backward compatibility.

Applied to files:

  • router-tests/testenv/testenv.go
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

  • router/core/router.go
  • router/core/graph_server.go
📚 Learning: 2025-11-19T15:13:57.821Z
Learnt from: dkorittki
Repo: wundergraph/cosmo PR: 2273
File: router/core/graphql_handler.go:0-0
Timestamp: 2025-11-19T15:13:57.821Z
Learning: In the Cosmo router (wundergraph/cosmo), error handling follows a two-phase pattern: (1) Prehandler phase handles request parsing, validation, and setup errors using `httpGraphqlError` and `writeOperationError` (in files like graphql_prehandler.go, operation_processor.go, parse_multipart.go, batch.go); (2) Execution phase handles resolver execution errors using `WriteError` in GraphQLHandler.ServeHTTP. Because all `httpGraphqlError` instances are caught in the prehandler before ServeHTTP is invoked, any error type checks for `httpGraphqlError` in the execution-phase WriteError method are unreachable code.

Applied to files:

  • router/core/router.go
  • router/core/graph_server.go
📚 Learning: 2025-09-03T11:38:45.855Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2183
File: router/internal/traceclient/traceclient.go:115-118
Timestamp: 2025-09-03T11:38:45.855Z
Learning: In the Cosmo codebase, the exprContext obtained from getExprContext in TraceInjectingRoundTripper.processConnectionMetrics is expected to always be non-nil as part of the existing system behavior, so defensive nil checks are not needed.

Applied to files:

  • router/core/operation_metrics.go
🧬 Code graph analysis (14)
router/internal/exporter/exporter_bench_test.go (1)
router/internal/exporter/exporter.go (3)
  • ExporterSettings (52-63)
  • RetryOptions (35-40)
  • NewExporter (84-120)
router/pkg/metric/config.go (2)
router/pkg/config/config.go (1)
  • PrometheusSchemaFieldUsageExporter (121-126)
router-tests/testenv/testenv.go (1)
  • PrometheusSchemaFieldUsageExporter (285-290)
router/pkg/config/config.go (2)
router-tests/testenv/testenv.go (1)
  • PrometheusSchemaFieldUsageExporter (285-290)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
router/internal/graphqlmetrics/graphql_exporter_test.go (2)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • NewGraphQLMetricsExporter (20-40)
router/internal/exporter/exporter.go (2)
  • ExporterSettings (52-63)
  • RetryOptions (35-40)
router-tests/prometheus_improved_test.go (4)
router-tests/testenv/testenv.go (4)
  • Run (105-122)
  • PrometheusSchemaFieldUsageExporter (285-290)
  • Environment (1763-1799)
  • GraphQLRequest (1939-1947)
router/pkg/config/config.go (1)
  • PrometheusSchemaFieldUsageExporter (121-126)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
router/pkg/otel/attributes.go (2)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
router/internal/exporter/exporter.go (1)
router/internal/exporter/sink.go (2)
  • Sink (9-18)
  • SinkErrorHandler (23-23)
router/core/router_metrics.go (5)
router/core/operation_metrics.go (1)
  • OperationMetrics (31-40)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • GraphQLMetricsExporter (14-16)
router/internal/graphqlmetrics/prometheus_exporter.go (1)
  • PrometheusMetricsExporter (14-16)
graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (13)
  • OperationInfo (362-373)
  • OperationInfo (388-388)
  • OperationInfo (403-405)
  • OperationType (26-26)
  • OperationType (58-60)
  • OperationType (62-64)
  • OperationType (71-73)
  • SchemaUsageInfo (130-153)
  • SchemaUsageInfo (168-168)
  • SchemaUsageInfo (183-185)
  • SchemaInfo (428-435)
  • SchemaInfo (450-450)
  • SchemaInfo (465-467)
router/core/context.go (4)
  • OperationType (503-503)
  • OperationTypeQuery (506-506)
  • OperationTypeMutation (507-507)
  • OperationTypeSubscription (508-508)
router-tests/testenv/testenv.go (2)
router/pkg/config/config.go (2)
  • PrometheusSchemaFieldUsageExporter (121-126)
  • PrometheusSchemaFieldUsage (115-119)
router/pkg/metric/config.go (2)
  • PrometheusSchemaFieldUsageExporter (48-53)
  • PrometheusSchemaFieldUsage (42-46)
router/pkg/graphqlschemausage/schemausage_bench_test.go (2)
router/pkg/graphqlschemausage/schemausage_test.go (1)
  • FakeFactory (527-529)
router/pkg/graphqlschemausage/schemausage.go (4)
  • GetTypeFieldUsageInfo (14-23)
  • GetArgumentUsageInfo (108-125)
  • GetInputUsageInfo (160-172)
  • TypeFieldMetrics (26-26)
router/core/router.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • NewGraphQLMetricsExporter (20-40)
router/internal/exporter/exporter.go (2)
  • NewDefaultExporterSettings (65-78)
  • Exporter (17-33)
router/pkg/config/config.go (3)
  • Prometheus (99-113)
  • PrometheusSchemaFieldUsageExporter (121-126)
  • Metrics (137-142)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
router/core/graph_server.go (4)
router/internal/graphqlmetrics/prometheus_exporter.go (2)
  • PrometheusMetricsExporter (14-16)
  • NewPrometheusMetricsExporter (20-46)
router/pkg/config/config.go (1)
  • Prometheus (99-113)
router/internal/exporter/exporter.go (3)
  • ExporterSettings (52-63)
  • Exporter (17-33)
  • RetryOptions (35-40)
router/core/router_metrics.go (1)
  • NewRouterMetrics (44-53)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
  • Store (149-162)
router/pkg/otel/attributes.go (5)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
  • WgOperationSha256 (59-59)
  • WgGraphQLFieldName (60-60)
  • WgGraphQLParentType (61-61)
router/pkg/graphqlschemausage/schemausage.go (1)
  • TypeFieldMetrics (26-26)
router/core/operation_metrics.go (1)
router/core/router_metrics.go (1)
  • RouterMetrics (15-22)
router/internal/graphqlmetrics/prometheus_exporter.go (3)
router/internal/exporter/exporter.go (3)
  • Exporter (17-33)
  • ExporterSettings (52-63)
  • NewExporter (84-120)
router/pkg/metric/metric_store.go (1)
  • Store (149-162)
router/internal/graphqlmetrics/prometheus_sink.go (2)
  • NewPrometheusSink (40-46)
  • PrometheusSinkConfig (24-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
🔇 Additional comments (22)
router/internal/exporter/sink.go (1)

7-23: Sink interface and error handler look well-shaped for generic exporters

The concurrency-safe Sink[T] contract and SinkErrorHandler callback cleanly match the needs of the generic Exporter[T]; no changes needed here from a correctness or API-design standpoint.

router/internal/exporter/exporter_bench_test.go (1)

13-238: Benchmarks comprehensively cover exporter hot paths

The mock sink and the set of benchmarks (allocation, throughput, batch cycling, buffer growth, and parallel recording) are well-structured and map directly onto the exporter’s key behaviors. Setup is kept outside the measured sections, and Shutdown is called to flush state, which should give useful and stable signals for future optimizations.

router/pkg/graphqlschemausage/schemausage_bench_test.go (1)

17-194: Schema-usage benchmarks realistically exercise the extraction pipeline

The setupBenchmark helper builds a realistic plan and variable set, and each benchmark cleanly focuses on a specific stage (field, argument, input usage, protobuf conversion, and end-to-end). Timer reset and allocation reporting are placed correctly so setup cost isn’t counted. This is a solid basis for tracking allocation and CPU impact of the new exporter wiring.

router/core/operation_metrics.go (1)

32-39: Usage tracking gating between GraphQL and Prometheus exporters looks correct

The added trackUsageInfo / prometheusTrackUsageInfo flags and the reqContext.operation != nil && !SkipLoader guard cleanly separate GraphQL vs Prometheus schema-usage export while avoiding emissions for planning-only operations. The options struct wiring in newOperationMetrics matches the struct fields, so callers can independently toggle both paths.

Also applies to: 77-88, 92-99, 109-116

router/pkg/config/testdata/config_defaults.json (1)

66-71: LGTM! Configuration structure correctly updated.

The replacement of the scalar SampleRate field with a nested Exporter configuration object aligns with the PR's goal of moving to a batch exporter framework. The default values are consistent with the Go struct definitions.

router/pkg/config/config.go (1)

115-126: LGTM! Configuration types properly defined.

The new nested configuration structure maintains type safety and follows the codebase's environment variable naming conventions. The envPrefix propagation ensures clean environment variable names like PROMETHEUS_SCHEMA_FIELD_USAGE_EXPORTER_BATCH_SIZE.

router/pkg/metric/config.go (1)

6-6: LGTM! Import added for new time.Duration fields.

The time package import is correctly added to support the Interval and ExportTimeout duration fields in the exporter configuration.

router/core/graph_server.go (5)

38-40: LGTM! Imports correctly added for new Prometheus exporter.

The imports for the internal exporter and graphqlmetrics packages are necessary for the new Prometheus schema field usage exporter functionality.


523-523: LGTM! Exporter field properly added to graphMux.

The prometheusMetricsExporter field is correctly typed and positioned within the graphMux struct for lifecycle management.


771-775: LGTM! Shutdown logic properly implemented.

The Prometheus exporter shutdown is correctly integrated into the graphMux.Shutdown method with proper nil checking and error handling.


951-958: LGTM! RouterMetrics properly wired with Prometheus exporter.

The prometheusMetricsExporter is correctly passed to NewRouterMetrics alongside the existing gqlMetricsExporter, enabling both export paths as intended.


924-949: Retry option defaults are necessary per exporter validation logic.

The default values for MaxRetry, MaxDuration, and Interval at lines 934-936 are required. The exporter's validate() method (router/internal/exporter/exporter.go:139-149) unconditionally validates that all retry option fields must be positive, regardless of the Enabled flag. While these fields are not used when Enabled=false (early return at line 261), the validation still enforces positive values, making the defaults necessary to pass validation.

The implementation is correct and follows the existing benchmark test patterns.

router/internal/graphqlmetrics/prometheus_exporter.go (3)

20-46: LGTM! Constructor properly initializes Prometheus exporter.

The constructor correctly:

  • Creates a PrometheusSink with the provided configuration
  • Sets up a non-retrying error handler (appropriate for local Prometheus metrics)
  • Wraps the generic exporter with a clean API

The comments explaining why retry is disabled are helpful.


51-59: LGTM! Public API methods are well-documented and properly delegate.

The RecordUsage and Shutdown methods provide clear documentation and correctly delegate to the underlying generic exporter. The comments explain behavior clearly, including the synchronous parameter and return value semantics.


1-59: LGTM! Well-designed wrapper for Prometheus metrics export.

This new file provides a clean, focused API for Prometheus schema usage metrics. The design:

  • Properly wraps the generic exporter with Prometheus-specific configuration
  • Uses composition over inheritance effectively
  • Maintains type safety with the generic Exporter[*graphqlmetrics.SchemaUsageInfo]
  • Provides clear documentation
  • Follows established patterns in the codebase

The implementation is straightforward and maintainable.

router-tests/testenv/testenv.go (1)

279-290: Prometheus schema usage exporter wiring in testenv matches production config

The new PrometheusSchemaFieldUsageExporter test type and the construction of promSchemaUsage (including test-friendly defaults and optional overrides) are consistent with router/pkg/metric.PrometheusSchemaFieldUsageExporter and the new exporter-based design. This should keep tests aligned with real configuration behavior while shortening flush intervals.

Also applies to: 1508-1552

router-tests/prometheus_improved_test.go (1)

352-357: The code uses valid Go 1.22+ syntax and does not require changes

The for range N syntax at lines 352-357 and 406-411 is valid Go code. Go 1.22 introduced the ability to range over integers, making for range n syntax valid for iterating n times in loops. The codebase requires Go 1.25 or later, which fully supports this syntax. The suggested replacement with for i := 0; i < N; i++ would be an unnecessary regression to older patterns. The code is correct as written.

Likely an incorrect or invalid review comment.

router/internal/graphqlmetrics/graphql_exporter_test.go (1)

73-109: <!-- [scratchpad_start]
[task verify_review_comment_1/1]
Verify Go syntax claim about for i := range totalItems validity.

[observations]

  • Confirmed: all 5 line ranges have for i := range totalItems {
  • Confirmed: totalItems declared as int in all cases (100, 100, 10, 5, 100)
  • router/go.mod specifies: go 1.25

[critical_finding]

  • Go 1.22 introduced range-over-integers feature
  • for i := range n where n is int is VALID in Go 1.22+
  • This project requires Go 1.25, which fully supports this syntax
  • Review comment is based on pre-1.22 Go knowledge
  • Code is CORRECT and needs NO changes

[analysis]

  • Range-over-int is a documented, standard Go 1.22+ feature
  • Syntax matches the intended behavior: iterate from 0 to n-1
  • No compilation error will occur with Go 1.25

[done]
Verification complete: review comment is INCORRECT

The review comment is incorrect and should be disregarded.

Go 1.22 introduced support for ranging over integers, enabling the syntax for i := range n where n is an int. This project requires Go 1.25 (per router/go.mod), which fully supports this feature. The code at all specified line ranges is valid Go and compiles without error. The loops correctly iterate from 0 to totalItems-1 as intended.

Likely an incorrect or invalid review comment.

router/core/router.go (2)

842-852: Prometheus schema usage enabling & logging behavior is coherent with deferred exporter creation.

The new block correctly guards schema-usage behavior on PromSchemaFieldUsage.Enabled and defers actual exporter creation to graph mux wiring, while logging include_operation_sha for visibility. This matches the design where each mux has its own metric store and keeps bootstrap light.


2324-2333: Telemetry → Prometheus schema_usage exporter mapping is consistent.

MetricConfigFromTelemetry now fully maps cfg.Metrics.Prometheus.SchemaFieldUsage.{Enabled,IncludeOperationSha,Exporter.*} into rmetric.PrometheusSchemaFieldUsage, matching the config structs in router/pkg/config and router/pkg/metric/config.go. This should preserve schema validation guarantees and remove the old sampling knob cleanly.

router/core/router_metrics.go (2)

15-22: RouterMetrics wiring for GraphQL & Prometheus exporters is sound.

The interface and struct/config changes cleanly separate GraphQL and Prometheus exporters while keeping initialization centralized in NewRouterMetrics. Using exportEnabled for the GraphQL path and prometheusMetricsExporter != nil for PrometheusTrackUsageInfo is a straightforward way to gate the two sinks independently, and the new getters (GQLMetricsExporter, PrometheusMetricsExporter) preserve encapsulation.

Also applies to: 26-52, 59-68, 76-82


84-138: GraphQL schema usage export now robust to missing exporter and preserves cached structures.

The extra gqlMetricsExporter == nil guard ensures ExportSchemaUsageInfo is safe even if called directly from a RouterMetrics implementation. Reusing GetTypeFieldUsageInfoMetrics and the cached argument/input slices while cloning the operation name via strings.Clone respects the comments about not mutating shared state and should keep CPU/memory overhead low for unsampled exports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants