Capacity aware partitioning #22766

yuslepukhin · 2024-11-07T19:52:21Z

Description

Allow users to specify per EP specific resource constraints.
Currently, models that do not fit into device memory error out.

This PR lays groundwork for EP specific resource constrained graph
partitioning, subject to incremental feature additions.

Partitioning in this context means to assign graph nodes to a specific device (Execution Provider)
up to a certain limit that is every automatically inferred or provided by configuration.

In this implementation, we stop assigning nodes to CUDA once we reach the specified memory limit.

This allows users to run models on devices with limited memory or other limited resources and
offload parts of the graph on CPU or other EPs as configured.

The PR also introduces an ability to profile and save resource consumption on a per node basis.
The results of one or more runs are saved into a CSV file which can then be loaded to assist
partitioning.

Model architecture-based partitioning (like put N transformer blocks on GPU and embedding on CPU) is not implemented in this PR but will be coming in the future.

Motivation and Context

We want to allow models to run in constrained environments.

Pending

Annotation assisted partitioning

include/onnxruntime/core/graph/graph_viewer.h

onnxruntime/test/framework/session_state_test.cc

tianleiwu · 2024-11-08T05:38:38Z

How about the intermediate memory usage (workspace) for each node? That is usually unknown during partitioning, and even unknown during inference since op has no interface to tell its workspace size right now. For example, MultiHeadAttention op might call different cuda kernels (flash attention, cutlass fmha, tensorrt fmha kernel or unfused kernel), each has different memory consumption.

yuslepukhin · 2024-11-08T17:37:11Z

How about the intermediate memory usage (workspace) for each node? That is usually unknown during partitioning, and even unknown during inference since op has no interface to tell its workspace size right now.

This is true. The function is currently accounts for initializers and inputs. It cannot account for temporary allocations because those are done at inference time, and partitioning takes place well before kernels are instantiated.

The approach of computing memory patterns cannot be taken here since that relies on the presence of a runnable model which we do not have today in a constrained environment.

This PR is still at the experimental stage. I envision that most of the burden would be placed on the individual EPs GetCapability since EPs are in the best position to know the constraints which can be different from memory only or there might be additional considerations that may be accounted in the Resource Count.

The simplest way is to add an additional if/else to enumerate the kernels and attempt to infer the amount of temporary space. However, that creates an additional maintenance burden since we already have plenty of such places in optimizers and what not where we need to make sure that changes to individual kernels are reflected.

However, it would still work in its current form. One can try one setting and then lower it if the consumption is too much.

Another idea would be to run the model beforehand and record the consumption. Then use that trace to set the limit n the constrained environment.

tianleiwu · 2024-11-11T21:52:42Z

The function is currently accounts for initializers and inputs. It cannot account for temporary allocations

If so, I think the feature is not very helpful for vision or LLMs models due to the limitations.

Another idea would be to run the model beforehand and record the consumption. Then use that trace to set the limit n the constrained environment.

That's a good idea, and it will be great that we can support the use case.

BTW, a general way to help capacity constraint is that we can have a way to manually configure location of initializers and inputs. This can be extended to support offloading initializers to CPU, and only load them on the GPU when needed.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

onnxruntime/core/framework/graph_partitioner.cc

onnxruntime/test/framework/session_state_test.cc

onnxruntime/core/framework/execution_frame.h

onnxruntime/core/framework/execution_frame.cc

onnxruntime/core/framework/sequential_executor.cc

onnxruntime/core/framework/session_state.h

onnxruntime/core/framework/resource_accountant.cc

+
+void NodeStatsRecorder::ReportNodeStats(const std::string& node_name, const NodeAllocationStats& stats) {
+  std::lock_guard lock(impl_->mut_);
+  auto result = impl_->node_stats_.emplace(node_name, stats);


onnxruntime/test/framework/inference_session_test.cc

Implement GetSizeFromTensorTypeProo Wire in accounting Make CUDA EP resource aware and account on assignment Fix missing accountant for Ort format Remove redundant functions Remove unnecessary interface Fix DML issue, minor fixes Fix alert DEMO changes Implement node memory stats collection Place container in the session. Support nested graphs Add synchronization Update stats for the max consumption. Introduce input sizes computation.

include/onnxruntime/core/graph/indexed_sub_graph.h

onnxruntime/test/testdata/transformers/tiny_gpt2_beamsearch_node_stats.txt

onnxruntime/core/graph/graph.cc

onnxruntime/core/framework/sequential_executor.cc

onnxruntime/core/framework/resource_accountant.cc

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/core/framework/graph_partitioner.cc

Craigacp · 2025-02-12T23:39:57Z

What's the timeline on allowing more fine grained op placement after this PR has landed? We're hitting placement issues with tokenization ops that are required to be on CPU causing trouble when the rest of the graph is placed on GPU, and we'd like to be able to pin more of the tokenization subgraph onto CPU. More information is given here - #23154.

onnxruntime/test/testdata/transformers/tiny_gpt2_beamsearch_node_stats.txt

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

onnxruntime/core/framework/resource_accountant.cc

onnxruntime/core/framework/sequential_executor.cc

include/onnxruntime/core/framework/resource_accountant.h

yuslepukhin changed the title ~~[DRAFT] Graph aware partitioning~~ [DRAFT] Resource aware partitioning Nov 7, 2024

yuslepukhin changed the title ~~[DRAFT] Resource aware partitioning~~ [DRAFT] Capacity aware partitioning Nov 7, 2024

yuslepukhin commented Nov 7, 2024

View reviewed changes

include/onnxruntime/core/graph/graph_viewer.h Outdated Show resolved Hide resolved

yuslepukhin commented Nov 7, 2024

View reviewed changes

include/onnxruntime/core/graph/graph_viewer.h Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Nov 7, 2024

View reviewed changes

onnxruntime/test/framework/session_state_test.cc Fixed Show fixed Hide fixed

onnxruntime/test/framework/session_state_test.cc Fixed Show fixed Hide fixed

onnxruntime/test/framework/session_state_test.cc Fixed Show fixed Hide fixed

yuslepukhin force-pushed the yuslepukhin/graph_constrained_paritioning branch 4 times, most recently from 6244735 to b2bb641 Compare January 16, 2025 23:00

github-actions bot reviewed Jan 21, 2025

View reviewed changes

onnxruntime/core/providers/cuda/cuda_execution_provider.cc Show resolved Hide resolved

github-advanced-security bot found potential problems Jan 23, 2025

View reviewed changes

onnxruntime/core/framework/graph_partitioner.cc Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Jan 23, 2025

View reviewed changes

onnxruntime/test/framework/session_state_test.cc Fixed Show fixed Hide fixed

onnxruntime/test/framework/session_state_test.cc Fixed Show fixed Hide fixed

onnxruntime/test/framework/session_state_test.cc Fixed Show fixed Hide fixed

yuslepukhin commented Jan 24, 2025

View reviewed changes

onnxruntime/core/framework/execution_frame.h Outdated Show resolved Hide resolved

yuslepukhin commented Jan 24, 2025

View reviewed changes

onnxruntime/core/framework/execution_frame.cc Outdated Show resolved Hide resolved

yuslepukhin commented Jan 24, 2025

View reviewed changes

onnxruntime/core/framework/sequential_executor.cc Outdated Show resolved Hide resolved

yuslepukhin commented Jan 24, 2025

View reviewed changes

onnxruntime/core/framework/sequential_executor.cc Outdated Show resolved Hide resolved

yuslepukhin commented Jan 24, 2025

View reviewed changes

onnxruntime/core/framework/sequential_executor.cc Outdated Show resolved Hide resolved

yuslepukhin commented Jan 24, 2025

View reviewed changes

onnxruntime/core/framework/session_state.h Show resolved Hide resolved

github-advanced-security bot found potential problems Jan 24, 2025

View reviewed changes

onnxruntime/test/framework/inference_session_test.cc Fixed Show fixed Hide fixed

yuslepukhin force-pushed the yuslepukhin/graph_constrained_paritioning branch from 7596ac5 to cb2277d Compare January 24, 2025 18:51

yuslepukhin added 2 commits January 27, 2025 14:37

Merge branch 'main' into yuslepukhin/graph_constrained_paritioning

3d805b0

yuslepukhin force-pushed the yuslepukhin/graph_constrained_paritioning branch from 91dfeca to 3d805b0 Compare January 27, 2025 22:38

yuslepukhin added 3 commits January 31, 2025 11:45

Adjust CSV parsing

b8f6b7b

Merge branch 'main' into yuslepukhin/graph_constrained_paritioning

59a6ec7

Tests pass

b1d1467

Make threshold optional

d750406