The performance of many parallel applications depends on loop-level parallelism. However, manually parallelizing all loops may result in degrading parallel performance, as some of them cannot scale desirably to a large number of threads. In addition, the overheads of manually tuning loop parameters might prevent an application from reaching its maximum parallel performance. We illustrate how machine learning techniques can be applied to address these challenges. In the hpxML project, we develop a framework that is able to automatically capture the static and dynamic information of a system and use this information to tune loop parameters. Specificly, we have designed a novel method for determining the execution policy, chunk size, and prefetching distance of an HPX loop to achieve best possible performance by feeding static information captured during compilation and runtime-based dynamic information to our learning model.
The goal of this project is to combine machine learning methods, compiler transformations, and runtime introspection in order to maximize the use of available resources and to minimize execution time of the loops. Its design and implementation has several steps categorized as follow:
- Special Execution Policies and Parameters
- Designing the Learning Model and Feature Extraction
- Learning Model Implementation
We introduce two new HPX execution policies and one new HPX execution policy parameter, which enables the weights gathered by the learning model to be applied on the loop: par_if
and make_prefetcher_policy
. These policies instrument executors to be able to consume the weights produced by a binary logistic regression model, which is used to select the execution policy corresponding to the optimal code path to execute (sequential or parallel), and a multinomial logistic regression model, which is used to determine an efficient prefetching distance. Additionally, we created an new execution policy parameter, adaptive_chunk_size
, which uses a multinomial logistic regression model to determine an efficient chunk size. We have created a new special ClangTool which recognizes these annotated loops and transform them into equivalent code which instructs the runtime to apply the described regression models. More details can be found in /ClangTool
.
We use the binary and multinomial logistic regression models to select the optimum execution policy, chunk size, and prefetching distance for certain HPX loops based on both, static and dynamic information, with the goal of minimizing execution time. More details can be found in /logisticRegressionModel
.
We have created three new techniques that implement binary and multinomial logistic regression models at runtime:
-
Predicting Execution Policy: We propose a new function
seq_par
(/hpxml/hpx/parallel/seq_or_par.hpp
) that passes the extracted features for a loop that usespar_if
as its execution policy. In this technique, a Clang compiler automatically adds extra lines within a user's code as below that allows the runtime system to decide whether execute a loop sequentially or in parallel based on the return value ofseq_par
.Before compilation: for_each(par_if,range.begin(),range.end(),lambda); After compilation: if(seq_par(EXTRACTED_STATICE_DYNAMIC_FEATURES)) for_each(seq, range.begin(),range.end(),lambda); else for_each(par, range.begin(),range.end(),lambda); ...
If the output is false
the loop will execute sequentially and if the output is true
the loop will execute in parallel. This function takes the weights extracted during compilation and the values polled at runtime as inputs. Both static and dynamic loop's features are considered in this technique.
-
Predicting Efficient Chunk Size: We propose a new function
chunk_size_determination
(/hpxml/hpx/parallel/chunk_size_determination.hpp
) that passes the extracted features for a loop that usesadaptive_chunk_size
as its execution policy's parameter. In this technique, a Clang compiler changes a user's code automatically as shown below. This allows the runtime system to choose an optimum chunk size based on the output ofchunk_size_determination
. Both static and dynamic loop's features are considered in this technique.Before compilation: for_each(policy.with(adaptive_chunk_size()),range.begin(), range.end(),lambda); After compilation: for_each(policy.with(chunk_size_determination(EXTRACTED_STATICE_DYNAMIC_FEATURES))), range.begin(),range.end(),lambda); ...
-
Predicting Efficient Prefetching Distance: We propose a new function
prefetching_distance_determination
(/hpxml/hpx/parallel/prefetching_distance_determination.hpp
) that passes the extracted features for a loop that usesmake_prefetcher_policy
as its execution policy. In this technique, a Clang compiler changes a user's code automatically as shown below which allows the runtime system to choose an optimum prefetching distance based on the output ofprefetching_distance_determination
. Both static and dynamic loop's features are considered in this technique.Before compilation: for_each(make_prefetcher_policy(policy, prefetching_distance_factor, ...), range.begin(),range.end(),lambda); After compilation: for_each(make_prefetcher_policy(policy, prefetching_distance_determination(EXTRACTED_STATICE_DYNAMIC_FEATURES), ...), range.begin(),range.end(),lambda); ...
More details can be found in our recent published paper:
http://stellar.cct.lsu.edu/pubs/khatami_espm2_2017.pdf
- Install HPX (see instructions in
/hpxml
) - Install Clang 4.0.0 add our new ClangTool (see instructions in
/ClangTool
) - Design learning model (see instructions in
/logisticRegressionModel
)
Don't forget to join our IRC channel #ste||ar
if you need any help :)