Skip to content
Kevin McGrattan edited this page Aug 11, 2015 · 13 revisions

Starting with the release of FDS 6.1.0 the default version of FDS offers OpenMP parallelisation. Unlike the MPI parallelisation, OpenMP does not require the splitting of the computational domain. But since OpenMP is a shared memory parallelisation it is limited to the resources of one machine, whereas MPI can take advantage of multiple machines connected over a network.

By default, an OpenMP version of FDS should use all of the available processors or cores on a given machine. The number of available "threads" is indicated by FDS at the start of the run. You can just type the name of the executable if you want to see how many threads are available.

Limitations and recommended settings

Most processors today offer virtual threads or so called hyperthreading/SMT. So far all benchmarks performed have shown that hyperthreading is detrimental to OpenMP performance in FDS.

The degree of parallelisation increases with larger cell counts. So larger simulations will see a greater speedup. But at some point the performance will top out, on a dual socket Xeon X5570 this occurred somewhere between 0.5 and 2 million cells. Depending on cache sizes, memory bandwidths etc. this may be different for you.

The degree of parallelisation lies somewhere between 40 and 80 percent. According to Ahmdahl's law you will see a stark decrease in the return of investment as you add more threads. In most cases your computational efficiency (speedup/threads) will drop below 50 percent once you pass four threads. If you can run two simulations at the same time with four threads each instead of one with eight threads you will be making better use of your power bill.

When using MPI parallelisation you can also use OpenMP. Here you will want to limit the number of threads used by each MPI process. With P as the number of MPI processes launched per machine, T as the number of threads per MPI process and C as the number of physical cores of your machine you want to hit: N*T=C.

Parallelisation with MPI will always deliver greater speedups than OpenMP given the same number of cores to run on. So if you can safely use MPI (and still obtain valid results) you should do so. If you have additional computational resources you can add OpenMP parallelisation to speed things up further.

To summarize:

  • MPI will usually give you a greater speedup
  • expect a speedup of two when using four threads
  • beyond four threads you won't see much improvement
  • don't use hyperthreading, it slows things down

Limiting number of threads for OpenMP

To limit the number of threads, you need to set an environment variable OMP_NUM_THREADS. See below how this works on Linux and Windows.

For Linux, to limit the number of threads to, say, 2, enter

export OMP_NUM_THREADS=2

Note that this only affects the given session. If you want to create a default, enter this command in the start up script.

For Windows, to limit the number of threads you have to create a new environment variable called OMP_NUM_THREADS. After saving the variable you have to restart your command line environment (normally no reboot is necessary). For a given session, you can just enter

set OMP_NUM_THREADS=2

Stacksize Issues

To run the OpenMP version of FDS, you usually have to allocate a certain amount of memory (RAM) to be used by the program. On a Windows computer, go to "System Properties", then "Advanced", then "Environment Variables." Add the new system variable OMP_STACKSIZE with the value of 16M. If FDS-OpenMP does not work, use a higher value for OMP_STACKSIZE (200M seems to be a good value). You can also adjust the OMP_STACKSIZE by typing

set OMP_STACKSIZE=16M

(for 16M) on your Windows command line before you start FDS.

Error Messages

If Windows (64-bit System) reports error messages like

  • OMP: Error #136: Cannot create thread.
  • OMP: System error #8: Not enough storage is available to process this command. try to reduce your OMP_STACKSIZE value if it is "large" (e. g. 1G). This has solved the problem for some tests.
Clone this wiki locally