Skip to content
kostrzewa edited this page Mar 13, 2013 · 3 revisions

This page is meant as a record of a few measurements I've been conducting using scalasca on 512 nodes of BG/Q. The measurements refer to 5 volume source inversions on a 48^3x96 configuration. This is for a very heavy quark mass, however, as the CG reaches a residue of O(1e-19) in 808 iterations.

Scalasca measurements

Ignoring I/O and source preparation (which amount to more than 25% of total time! 12% is spent writing propagators!), we normalize total time spent to 74.33%, the proportion of time spent in cg_her. Of these 56% is spent applying Qtm_pm_psi with the usual overheads of the hopping matrix. The remaining 44% are spent doing linear algebra.

Starting situation

  • 56% Qtm_pm_psi
  • 11.8% scalar_prod_r
  • 9.8% assign_add_mul_r
  • 12.2% assign_mul_add_r_and_square
  • 9.8% assign_add_mul_r (second call)

In the calls involving collectives about 20% of the time is spent waiting for MPI_Allreduce. About 50% of the time is spent doing the linear algebra and the remaining 30% is spent outside the parallel section. As far as I understand, this is usually a good measure of OpenMP overhead.

  • Approximate breakdown of linear algebra routines with collectives:
    • 20% MPI_Allreduce
    • 50% parallel section doing linear algebra
    • 30% outside of parallel section (OpenMP overhead)
  • The routines without collectives look similar with slightly changed percentages due to the lack of MPI_Allreduce.

A comparison using a pure MPI run will be helpful in elucidating these points. In the following few tests, aspects of the linear algebra routines will be modified and the effect quantified here.

Clone this wiki locally