-
Notifications
You must be signed in to change notification settings - Fork 631
OpenMP Notes
Starting with the release of FDS 6.1.0 the default version of FDS offers OpenMP parallelisation. Unlike the MPI parallelisation, OpenMP does not require the splitting of the computational domain. But since OpenMP is a shared memory parallelisation it is limited to the resources of one machine, whereas MPI can take advantage of multiple machines connected over a network.
By default, an OpenMP version of FDS should use all of the available processors or cores on a given machine. The number of available "threads" is indicated by FDS at the start of the run. You can just type the name of the executable if you want to see how many threads are available.
Most processors today offer virtual threads or so called hyperthreading/SMT. So far all benchmarks performed have shown that hyperthreading is detrimental to OpenMP performance in FDS.
The degree of parallelisation increases with larger cell counts. So larger simulations will see a greater speedup. But at some point the performance will top out, on a dual socket Xeon X5570 this occurred somewhere between 0.5 and 2 million cells. Depending on cache sizes, memory bandwidths etc. this may be different for you.
The degree of parallelisation lies somewhere between 40 and 80 percent. According to Ahmdahl's law you will see a stark decrease in the return of investment as you add more threads. In most cases your computational efficiency (speedup/threads) will drop below 50 percent once you pass four threads. If you can run two simulations at the same time with four threads each instead of one with eight threads you will be making better use of your power bill.
When using MPI parallelisation you can also use OpenMP. Here you will want to limit the number of threads used by each MPI process. With P as the number of MPI processes launched per machine, T as the number of threads per MPI process and C as the number of physical cores of your machine you want to hit: N*T=C.
Parallelisation with MPI will always deliver greater speedups than OpenMP given the same number of cores to run on. So if you can safely use MPI (and still obtain valid results) you should do so. If you have additional computational resources you can add OpenMP parallelisation to speed things up further.
To summarize:
- MPI will usually give you a greater speedup
- expect a speedup of two when using four threads
- beyond four threads you won't see much improvement
- don't use hyperthreading, it slows things down
To limit the number of threads, you need to set an environment variable OMP_NUM_THREADS
. See below how this works on Linux and Windows.
To limit the number of threads on Linux just enter
export OMP_NUM_THREADS=2
on the command prompt if you want to "give" FDS-OpenMP 2 threads. Then start FDS-OpenMP.
To limit the number of threads on Windows you have to create a new environment variable on your system properties. The name of this new environment variable must be OMP_NUM_THREADS
and the value of this variable is the number of threads you want to allow. After saving the variable you have to restart your command line environment (normally no reboot is necessary), then it works. Another possibility to set-up the number of threads is just to type
set OMP_NUM_THREADS=2
(for 2 threads) on your Windows command line before you start FDS.
== Stacksize Issues ==
To run the OpenMP version of FDS, you usually have to allocate a certain amount of memory (RAM) to be used by the program. On a Windows computer, go to "System Properties", then "Advanced", then "Environment Variables." Add the new system variable OMP_STACKSIZE with the value of 16M. If FDS-OpenMP does not work, use a higher value for OMP_STACKSIZE (200M seems to be a good value). You can also adjust the OMP_STACKSIZE by typing
{{{ set OMP_STACKSIZE=16M }}}
(for 16M) on your Windows command line before you start FDS.
=== Error Messages === If Windows (64-bit System) reports error messages like
- OMP: Error #136: Cannot create thread.
- OMP: System error #8: Not enough storage is available to process this command. try to reduce your OMP_STACKSIZE value if it is "large" (e. g. 1G). This has solved the problem at some tests.