Skip to content

Conversation

@hodgesds
Copy link
Contributor

@hodgesds hodgesds commented Dec 1, 2025

Add L2 cluster awareness to improve cache locality by preferring CPUs
within the same cluster before searching the wider LLC domain.

Infrastructure:

  • Add cluster_id to cpu_ctx for per-CPU cluster tracking
  • Add has_clusters flag to topo_config
  • Initialize cluster_id for each CPU during BPF setup
  • Populate cluster IDs from topology in userspace

Implementation:

  • Add pick_idle_cpu_in_cluster() helper to search cluster cpumask
  • Enhance pick_idle_cpu() to try cluster-level before LLC-level
  • Update wakeup paths for interactive tasks to prefer cluster
  • Check same-cluster waker/wakee before wider search

This improves cache locality by keeping related tasks on CPUs sharing
L2 cache, reducing cache misses and improving performance.

@hodgesds hodgesds force-pushed the p2dq-l2-topo branch 2 times, most recently from 39a308e to 8277399 Compare December 1, 2025 21:58
@hodgesds
Copy link
Contributor Author

hodgesds commented Dec 1, 2025

Comparison: Optimized Cluster-Aware vs Baseline

schbench Latency Test (2 threads × 16 messages, 30s) - 5 Run Average

Metric Baseline Optimized Cluster-Aware Change
Average RPS 7,801.15 7,784.21 -0.22% ✓
RPS Range 7,771 - 7,830 (59 RPS) 7,727 - 7,836 (109 RPS) +1.8x variance
Std Deviation ~25 RPS (0.32%) ~48 RPS (0.61%) +1.9x
Wakeup Latency p99 15 µs 14 µs -6.7% ✓
Request Latency p99 7,720 µs 7,704 µs -0.2% ✓

Result: Near-identical performance with baseline in low-contention scenarios.


stress-ng Cache Benchmark (60s, 316 workers)

Metric Baseline Optimized Cluster-Aware Change
Cache ops/sec 36,241,353 38,130,668 +5.21% ✓
Cache writes/sec 2,777,973 3,931,893 +41.5% ✓✓

Result: Significant cache performance improvement from L2 locality.


stress-ng CPU Benchmark (60s, 316 workers)

Metric Baseline Optimized Cluster-Aware Change
Bogo ops/sec 332,453 333,337 +0.27% ✓

Result: No CPU throughput regression.


schbench High Load Test (8 threads × 32 messages, 30s)

Metric Baseline Optimized Cluster-Aware Change
Average RPS 31,773 31,923 +0.47% ✓
Wakeup Latency p99 22 µs 21 µs -4.5% ✓
Request Latency p99 9,648 µs 9,584 µs -0.66% ✓

Result: Improved performance under high contention.


Key Findings

✅ Wins

  1. Cache Performance: +5.21% overall, +41.5% cache writes
  2. High-Load Performance: +0.47% RPS, better latencies
  3. Low-Load Performance: -0.22% RPS (essentially baseline)
  4. Wakeup Latency: 4.5-6.7% faster across all scenarios
  5. CPU Throughput: Neutral (+0.27%)

⚠️ Minor Considerations

  1. Variance: Slightly higher variance in low-load scenarios (1.9x vs baseline)
    - Still very acceptable at 0.61% standard deviation
    - Baseline has exceptionally low variance (0.32%)

@hodgesds
Copy link
Contributor Author

hodgesds commented Dec 1, 2025

clangd-format messed up some of the bpf code, will push a fix.

@hodgesds hodgesds force-pushed the p2dq-l2-topo branch 4 times, most recently from 67b95fd to c824611 Compare December 2, 2025 01:01
Detect L2 cache domains within LLCs by reading CPU cache topology from
sysfs. This enables schedulers to make cache-aware placement decisions
at a finer granularity than LLC.

Signed-off-by: Daniel Hodges <[email protected]>
Drop topo.all_clusters before iterating topo.all_cores to release Arc
references. Clusters hold Arc references to cores, preventing
Arc::into_inner() from succeeding during topology setup.

Signed-off-by: Daniel Hodges <[email protected]>
Add L2 cluster awareness to improve cache locality by preferring CPUs
within the same cluster before searching the wider LLC domain.

Infrastructure:
- Add cluster_id to cpu_ctx for per-CPU cluster tracking
- Add has_clusters flag to topo_config
- Initialize cluster_id for each CPU during BPF setup
- Populate cluster IDs from topology in userspace

Implementation:
- Add pick_idle_cpu_in_cluster() helper to search cluster cpumask
- Enhance pick_idle_cpu() to try cluster-level before LLC-level
- Update wakeup paths for interactive tasks to prefer cluster
- Check same-cluster waker/wakee before wider search

This improves cache locality by keeping related tasks on CPUs sharing
L2 cache, reducing cache misses and improving performance.

Signed-off-by: Daniel Hodges <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants