Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers being limited by RAM bandwidth #2077

Open
Viren6 opened this issue Jun 16, 2024 · 2 comments
Open

Workers being limited by RAM bandwidth #2077

Viren6 opened this issue Jun 16, 2024 · 2 comments

Comments

@Viren6
Copy link
Contributor

Viren6 commented Jun 16, 2024

The script from https://github.com/official-monty/montytools/blob/main/BenchNormalization/benchNormToolSF.py was ran with SF16.1 on 2 Ryzen 9 7950X (Eco mode off) systems. Both systems had DDR5 6000Mhz RAM but one had only 1x16GB (Ciekce) whereas the other had 2x16GB (Zuppa). The results (Final Average Benchmark NPS over a minute) are below:

Ciekce (1x16GB):
1 Process: 1869666
32 Processes: 332322

Zuppa (2x16GB):
1 Process: 1814631
32 Processes: 511087

The system with double the bandwidth had 54% higher nps when running 32 processes.

The script from https://github.com/official-monty/montytools/blob/main/BenchNormalization/benchNormToolMonty.py was ran with the monty chess engine on the same systems providing a reference for without RAM bandwidth limitation:

Ciekce (1x16GB):
1 Process: 685668
32 Processes: 365731

Zuppa (2x16GB):
1 Process: 688477
32 Processes: 370616

The same ratio is 1M nps for SF16.1 with 32 processes. It is therefore likely Zuppa system running SF is still limited by RAM bandwidth even though it represents the highest possible on consumer motherboards (dual channel).

Furthermore, the net in dev is 20% larger and this issue is also expected to become even more severe in the future as CPU speeds advance faster than RAM bandwidth.

The only solution I see is to raise the default threads for each test from 1 to 2. This may require the fastchess migration to prevent time losses as TC is scaled down. The solution of reducing processes equal to number of physical cores is still RAM bandwidth limited now and reducing processes further results in bad CPU utilization.

Additionally, the method in which we measure the nps of a worker is invalid. Currently we run one process with a bench and one process with the search of n-1 threads. This doesn't account for these RAM bandwidth limitations discussed and therefore the measured nps is far faster than the real nps.

@vondele
Copy link
Member

vondele commented Jun 17, 2024

To verify that the "solution" works, one would also need to have the result of running 16process@2threads (ideally also 8@4 and 4@8 and ...) and see the nps. Would be great if you could collect some data for that.

On the other hand, we need to think once if there is no way to reduce SF memory BW needs.

@Viren6
Copy link
Contributor Author

Viren6 commented Jun 17, 2024

Results of benches (Final Average Benchmark NPS over a minute) with 64MB hash and depth 16 on Ciekce system (1x16GB):

Concurrency: 1
Threads:     1
NPS:         1816044
NPS/Thread:  1816044
Concurrency: 32
Threads:     1
NPS: 	     316559
NPS/Thread:  316559
Concurrency: 16
Threads:     2
NPS:         893445
NPS/Thread:  446723
Concurrency: 8
Threads:     4
NPS:         2559356
NPS/Thread:  639839
Concurrency: 4
Threads:     8
NPS:         7272827
NPS/Thread:  909103
Concurrency: 2
Threads:     16
NPS:         17664663
NPS/Thread:  1104040
Concurrency: 1
Threads:     32
NPS:         40885107
NPS/Thread:  1277660

Sharing does help a lot. Though the curve isn't steep enough that changing threads can solve it, will need to find some other way to achieve it..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants