scx_p2dq: Add L2 cluster-aware CPU selection #3114

hodgesds · 2025-12-01T21:24:03Z

Add L2 cluster awareness to improve cache locality by preferring CPUs
within the same cluster before searching the wider LLC domain.

Infrastructure:

Add cluster_id to cpu_ctx for per-CPU cluster tracking
Add has_clusters flag to topo_config
Initialize cluster_id for each CPU during BPF setup
Populate cluster IDs from topology in userspace

Implementation:

Add pick_idle_cpu_in_cluster() helper to search cluster cpumask
Enhance pick_idle_cpu() to try cluster-level before LLC-level
Update wakeup paths for interactive tasks to prefer cluster
Check same-cluster waker/wakee before wider search

This improves cache locality by keeping related tasks on CPUs sharing
L2 cache, reducing cache misses and improving performance.

hodgesds · 2025-12-01T22:05:54Z

Comparison: Optimized Cluster-Aware vs Baseline

schbench Latency Test (2 threads × 16 messages, 30s) - 5 Run Average

Metric	Baseline	Optimized Cluster-Aware	Change
Average RPS	7,801.15	7,784.21	-0.22% ✓
RPS Range	7,771 - 7,830 (59 RPS)	7,727 - 7,836 (109 RPS)	+1.8x variance
Std Deviation	~25 RPS (0.32%)	~48 RPS (0.61%)	+1.9x
Wakeup Latency p99	15 µs	14 µs	-6.7% ✓
Request Latency p99	7,720 µs	7,704 µs	-0.2% ✓

Result: Near-identical performance with baseline in low-contention scenarios.

stress-ng Cache Benchmark (60s, 316 workers)

Metric	Baseline	Optimized Cluster-Aware	Change
Cache ops/sec	36,241,353	38,130,668	+5.21% ✓
Cache writes/sec	2,777,973	3,931,893	+41.5% ✓✓

Result: Significant cache performance improvement from L2 locality.

stress-ng CPU Benchmark (60s, 316 workers)

Metric	Baseline	Optimized Cluster-Aware	Change
Bogo ops/sec	332,453	333,337	+0.27% ✓

Result: No CPU throughput regression.

schbench High Load Test (8 threads × 32 messages, 30s)

Metric	Baseline	Optimized Cluster-Aware	Change
Average RPS	31,773	31,923	+0.47% ✓
Wakeup Latency p99	22 µs	21 µs	-4.5% ✓
Request Latency p99	9,648 µs	9,584 µs	-0.66% ✓

Result: Improved performance under high contention.

Key Findings

✅ Wins

Cache Performance: +5.21% overall, +41.5% cache writes
High-Load Performance: +0.47% RPS, better latencies
Low-Load Performance: -0.22% RPS (essentially baseline)
Wakeup Latency: 4.5-6.7% faster across all scenarios
CPU Throughput: Neutral (+0.27%)

⚠️ Minor Considerations

Variance: Slightly higher variance in low-load scenarios (1.9x vs baseline)
- Still very acceptable at 0.61% standard deviation
- Baseline has exceptionally low variance (0.32%)

hodgesds · 2025-12-01T22:08:09Z

clangd-format messed up some of the bpf code, will push a fix.

Detect L2 cache domains within LLCs by reading CPU cache topology from sysfs. This enables schedulers to make cache-aware placement decisions at a finer granularity than LLC. Signed-off-by: Daniel Hodges <[email protected]>

Drop topo.all_clusters before iterating topo.all_cores to release Arc references. Clusters hold Arc references to cores, preventing Arc::into_inner() from succeeding during topology setup. Signed-off-by: Daniel Hodges <[email protected]>

Add L2 cluster awareness to improve cache locality by preferring CPUs within the same cluster before searching the wider LLC domain. Infrastructure: - Add cluster_id to cpu_ctx for per-CPU cluster tracking - Add has_clusters flag to topo_config - Initialize cluster_id for each CPU during BPF setup - Populate cluster IDs from topology in userspace Implementation: - Add pick_idle_cpu_in_cluster() helper to search cluster cpumask - Enhance pick_idle_cpu() to try cluster-level before LLC-level - Update wakeup paths for interactive tasks to prefer cluster - Check same-cluster waker/wakee before wider search This improves cache locality by keeping related tasks on CPUs sharing L2 cache, reducing cache misses and improving performance. Signed-off-by: Daniel Hodges <[email protected]>

hodgesds force-pushed the p2dq-l2-topo branch 2 times, most recently from 39a308e to 8277399 Compare December 1, 2025 21:58

hodgesds force-pushed the p2dq-l2-topo branch 4 times, most recently from 67b95fd to c824611 Compare December 2, 2025 01:01

hodgesds added 2 commits December 1, 2025 17:01

scx_utils: Add L2 cluster detection to topology

4cae859

Detect L2 cache domains within LLCs by reading CPU cache topology from sysfs. This enables schedulers to make cache-aware placement decisions at a finer granularity than LLC. Signed-off-by: Daniel Hodges <[email protected]>

hodgesds force-pushed the p2dq-l2-topo branch from c824611 to b14b678 Compare December 2, 2025 01:01

hodgesds requested review from arighi, etsal, htejun, likewhatevs and multics69 December 2, 2025 01:01

oxyzenQ approved these changes Dec 2, 2025

View reviewed changes

hodgesds force-pushed the p2dq-l2-topo branch from b14b678 to 5f7f1db Compare December 2, 2025 17:40

likewhatevs approved these changes Dec 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scx_p2dq: Add L2 cluster-aware CPU selection #3114

scx_p2dq: Add L2 cluster-aware CPU selection #3114

Uh oh!

hodgesds commented Dec 1, 2025

Uh oh!

hodgesds commented Dec 1, 2025

Uh oh!

hodgesds commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

scx_p2dq: Add L2 cluster-aware CPU selection #3114

Are you sure you want to change the base?

scx_p2dq: Add L2 cluster-aware CPU selection #3114

Uh oh!

Conversation

hodgesds commented Dec 1, 2025

Uh oh!

hodgesds commented Dec 1, 2025

Uh oh!

hodgesds commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants