For torchbench benchmarks with dynamo backend, the aarch64 linux nightly wheel performance is 2x slow compared to the wheel I've built using the [pytorch/builder/build_aarch64_wheel.py script for the same pytorch commit. ](https://github.com/pytorch/builder/blob/main/aarch64_linux/build_aarch64_wheel.py) The difference seems to be coming from the https://github.com/pytorch/builder/blob/main/aarch64_linux/aarch64_ci_build.sh used for nightly builds. I suspect it's with the libomp. How to reproduce? ``` git clone https://github.com/pytorch/benchmark.git cd benchmark # apply this PR: https://github.com/pytorch/benchmark/pull/2187 # setting omp threads =16, because i'm using c7g.4xl instance OMP_NUM_THREADS=16 python3 run_benchmark.py cpu --model hf_DistilBert --test eval --torchdynamo inductor --freeze_prepack_weights --metrics="latencies,cpu_peak_mem" ```