Skip to content

Commit

Permalink
[NNPA] Set device placement by using a JSON file (#2536)
Browse files Browse the repository at this point in the history
* Load/Save configuration by using JSON files

Signed-off-by: Tung D. Le <[email protected]>

---------

Signed-off-by: Tung D. Le <[email protected]>
  • Loading branch information
tungld authored Oct 4, 2023
1 parent 3757344 commit d6767a5
Show file tree
Hide file tree
Showing 15 changed files with 555 additions and 34 deletions.
170 changes: 170 additions & 0 deletions docs/DevicePlacement-NNPA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
<!--- SPDX-License-Identifier: Apache-2.0 -->

# Device placement

Device placement is how the compiler place one operation on CPU or NNPA.

## Query device placement configuration

There are two ways to know which device an operation is placed on:
- Using `onnx-mlir --EmitONNXIR --maccel=NNPA model.onnx`, or
- Using `onnx-mlir --save-device-placement-file=cfg.json model.onnx`.

1. Using `--EmitONNXIR --maccel=NNPA`

When using `--EmitONNXIR --maccel=NNPA` options, each operation in the generated IR is annotated with an attribute `device` to show which device the operation is placed on. There are three posible values for `device`:
- "": the operation may be on CPU or NNPA depending on optimizations in the compiler.
- "nnpa": the operation is on NNPA.
- "cpu": the operation is on CPU.

Below is an example of the output of `--EmitONNXIR --maccel=NNPA`:
```mlir
%0 = "onnx.Relu"(%arg0) {onnx_node_name = "Relu_0"} : (tensor<?x?x?xf32>) -> tensor<?x?x?xf32>
%1 = "onnx.Relu"(%0) {device="cpu", onnx_node_name = "Relu_1"} : (tensor<?x?x?xf32>) -> tensor<?x?x?xf32>
%2 = "onnx.Relu"(%1) {onnx_node_name = "Relu_2"} : (tensor<?x?x?xf32>) -> tensor<?x?x?xf32>
%3 = "onnx.Sigmoid"(%2) {device="nnpa", onnx_node_name = "Sigmoid_0"} : (tensor<?x?x?xf32>) -> tensor<?x?x?xf32>
```

2. Using `--save-device-placement-file=cfg.json`

The option is to save the device placement configuration into a JSON file. This option is convenient when users don't want to interrupt the compilation.

The JSON file will contains a list of operation records. Each record includes three key-value pairs where keys are:
- "device": similar to `device` attribute in the operation.
- "node_type": ONNX node type, e.g. `onnx.Conv`, `onnx.MatMul`.
- "onnx_node_name": a string to denote ONNX node names.

Below is one example of a JSON file:
```json
{
"device_placement": [
{
"device":"nnpa",
"node_type":"onnx.Relu",
"onnx_node_name":"Relu_0"
},
{
"device":"cpu",
"node_type":"onnx.Relu",
"onnx_node_name":"Relu_1"},
{
"device":"nnpa",
"node_type":"onnx.Relu",
"onnx_node_name":"Relu_2"
},
{
"device":"nnpa",
"node_type":"onnx.Sigmoid",
"onnx_node_name":"Sigmoid_0"
}
]
}
```

## Set device placement manually.

We allow users to force one opeartion to run on a specific device. However, at this moment, only placing on CPU is guaranted to be successful done. It means that even when `device=NNPA` is specified, it is not guaranted that the operation will run on NNPA.

There are two ways to change device of an operation:
- by editing the output of `--EmitONNXIR --maccel=NNPA` directly and compile again.
- by passing a JSON file for device placement to the compiler by using `--load-device-placement-file=json`.

For the former option, it is straighforward, just changing the value of the `device` attribute of an operation, for example, changing `device=nnpa` to `device=cpu`.

For the later option, users can obtain a template file from `--save-device-placement-file`, and use it as the starting point of modification.
We use C++ std::regex_match function to match operations based on `node_type` and `onnx_node_name`. Both `node_type` and `onnx_node_name` must be satisfied.
The JSON file will contain a list of records for each operation matching. The order of the records does matter. If one operation matches a record and is set device, it will not be set device again even when it matches the later records in the list. If one operation does not match a record but matches a later record, the operation is still set device by the later record. In other words, the device of an operation is set by the first matched record.

Below are some examples for the later option. Given an input program:
```mlir
func.func @test_load_config_file_all_on_cpu(%arg0: tensor<?x?x?xf32>) -> tensor<?x?x?xf32> {
%0 = "onnx.Relu"(%arg0) {onnx_node_name = "Relu_0"} : (tensor<?x?x?xf32>) -> tensor<?x?x?xf32>
%1 = "onnx.Relu"(%0) {onnx_node_name = "Relu_1"} : (tensor<?x?x?xf32>) -> tensor<?x?x?xf32>
%2 = "onnx.Relu"(%1) {onnx_node_name = "Relu_2"} : (tensor<?x?x?xf32>) -> tensor<?x?x?xf32>
%3 = "onnx.Sigmoid"(%2) {onnx_node_name = "Sigmoid_0"} : (tensor<?x?x?xf32>) -> tensor<?x?x?xf32>
onnx.Return %3 : tensor<?x?x?xf32>
```

1. Schedule all operations to run on CPU
```json
{
"device_placement": [
{
"device": "cpu",
"node_type": "onnx.*",
"onnx_node_name": ".*"
}
]
}
```

2. Schedule all Relu operations to run on CPU:
```json
{
"device_placement": [
{
"device": "cpu",
"node_type": "onnx.Relu",
"onnx_node_name": ".*"
}
]
}
```
3. Schedule operations using onnx_node_name: here we use regex to chose only Relu_1 and Relu_2 operations, exact match is used for onnx.Sigmoid.
```json
{
"device_placement": [
{
"device": "cpu",
"node_type": "onnx.Relu",
"onnx_node_name": "Relu_(1|2)"
},
{
"device": "nnpa",
"node_type": "onnx.Sigmoid",
"onnx_node_name": "Sigmoid_0"
}
]
}
```

4. `onnx.Relu` does not match because there is no operation with `node_type = Relu`, so only `onnx.Sigmoid` is set device.
```json
{
"device_placement": [
{
"device": "cpu",
"node_type": "Relu",
"onnx_node_name": "Relu_(1|2)"
},
{
"device": "cpu",
"node_type": "onnx.Sigmoid",
"onnx_node_name": "Sigmoid_0"
}
]
}
```

5. We have two overlapping records both matching on `onnx.Relu`. In this case, only the first matched record will set device. Thus, `Relu_0` and `Relu_1` have device "cpu" by matching the first record, `Relu_2` operation has device "cpu" by matching the third record.
```json
{
"device_placement": [
{
"device": "cpu",
"node_type": "onnx.Relu",
"onnx_node_name": "Relu_(0|1)"
},
{
"device": "nnpa",
"node_type": "onnx.Sigmoid",
"onnx_node_name": "Sigmoid_0"
},
{
"device": "nnpa",
"node_type": "onnx.Relu",
"onnx_node_name": "Relu_(1|2)"
}
]
}
```
23 changes: 15 additions & 8 deletions src/Accelerators/NNPA/Compiler/NNPACompilerOptions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,6 @@ llvm::cl::opt<NNPAEmissionTargetType> nnpaEmissionTarget(
clEnumVal(EmitZNONE, "Do not emit NNPA-related target (default)")),
llvm::cl::init(EmitZNONE), llvm::cl::cat(OnnxMlirOptions));

llvm::cl::list<std::string> execNodesOnCpu{"execNodesOnCpu",
llvm::cl::desc("Comma-separated list of node names in an onnx graph. The "
"specified nodes are forced to run on the CPU instead of "
"using the zDNN. The node name is an optional attribute "
"in onnx graph, which is `onnx_node_name` in ONNX IR."),
llvm::cl::CommaSeparated, llvm::cl::ZeroOrMore,
llvm::cl::cat(OnnxMlirOptions)};

llvm::cl::opt<bool> nnpaClipToDLFloatRange("nnpa-clip-to-dlfloat-range",
llvm::cl::desc("Clip CPU tensors to dlfloat range before stickification to "
"avoid out-of-range. Only clip Softmax inputs at this "
Expand All @@ -48,6 +40,21 @@ llvm::cl::opt<bool> nnpaEnableZHighToOnnx("enable-zhigh-to-onnx",
"level. Default is true."),
llvm::cl::init(true), llvm::cl::cat(OnnxMlirOptions));

llvm::cl::opt<std::string> nnpaLoadDevicePlacementFile{
"nnpa-load-device-placement-file",
llvm::cl::desc(
"Load device placement configuration from a JSON file. To "
"have a template for the JSON file, use "
"-save-device-placement-file=cfg.json. Note that we can use regex for "
"string values in the JSON file to match operations. The compiler uses "
"C++ std::regex_match function for matching."),
llvm::cl::init(""), llvm::cl::cat(OnnxMlirOptions)};

llvm::cl::opt<std::string> nnpaSaveDevicePlacementFile{
"nnpa-save-device-placement-file",
llvm::cl::desc("Save device placement configuration to a JSON file."),
llvm::cl::init(""), llvm::cl::cat(OnnxMlirOptions)};

llvm::cl::opt<bool> nnpaEnableZHighPerfModel("enable-zhigh-perf-model",
llvm::cl::desc("Enabling performance cost model to estimate if ONNX "
"operations will be faster on the NNPA or the CPU. Works "
Expand Down
3 changes: 2 additions & 1 deletion src/Accelerators/NNPA/Compiler/NNPACompilerOptions.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,11 @@ typedef enum {

extern llvm::cl::OptionCategory OnnxMlirOptions;
extern llvm::cl::opt<onnx_mlir::NNPAEmissionTargetType> nnpaEmissionTarget;
extern llvm::cl::list<std::string> execNodesOnCpu;
extern llvm::cl::opt<bool> nnpaClipToDLFloatRange;
extern llvm::cl::opt<bool> nnpaEnableZHighToOnnx;
extern llvm::cl::opt<bool> nnpaEnableZHighPerfModel;
extern llvm::cl::opt<bool> profileZHighIR;
extern llvm::cl::opt<std::string> nnpaLoadDevicePlacementFile;
extern llvm::cl::opt<std::string> nnpaSaveDevicePlacementFile;

} // namespace onnx_mlir
3 changes: 2 additions & 1 deletion src/Accelerators/NNPA/Compiler/NNPACompilerUtils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,8 @@ void addPassesNNPA(mlir::OwningOpRef<mlir::ModuleOp> &module,
// LLVM_DEBUG(llvm::dbgs() << "Adding NNPA passes" << std::endl;);
if (emissionTarget >= EmitONNXIR) {
addONNXToMLIRPasses(pm, /*target CPU*/ maccel.empty());
pm.addPass(onnx_mlir::createDevicePlacementPass(nnpaEnableZHighPerfModel));
pm.addPass(onnx_mlir::createDevicePlacementPass(nnpaLoadDevicePlacementFile,
nnpaSaveDevicePlacementFile, nnpaEnableZHighPerfModel));
}

if (emissionTarget >= EmitMLIR) {
Expand Down
Loading

0 comments on commit d6767a5

Please sign in to comment.