Feature Request: Support AWS inferentia inf2 instances #8954

virajkanwade · 2024-08-09T15:20:47Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

AWS inf2 instances are supposed to provide better performance at cheaper rates. It would be great to support these instances.

Motivation

#2109
ollama/ollama#6143

Possible Implementation

No response

jeroen-mostert · 2024-08-10T22:03:32Z

None of this stuff appears to be open source and the interface to Amazon's devices doesn't appear to be documented. You're supposed to use their proprietary compiler to generate executables that can use these devices and Amazon gives you things like PyTorch-compatible wrappers to utilize these from Python.

The high-level C++ APIs for this only manage loading these executables and feeding them tensors. If you wanted to implement your own inference from scratch and run custom software like llama.cpp, you would need to either reverse engineer the device driver interface (and/or compiler, possibly using the Python code as a guide) or Amazon would need to release docs for it. This is not a "bring your own framework" setup, at least not yet. The fact that Amazon manages most of this stuff for you and they only have compatibility with their own software to worry about probably contributes to why they can offer those cheaper rates (aside from the hardware itself, of course).

guilt · 2024-08-13T08:35:43Z

This was discussed also in #2109 ;

I think that getting access to aws-neuronx-dkms (Kernel Driver), aws-neuronx-collectives+aws-neuronx-runtime-lib (Runtime) and aws-neuronx-tools (Tools) on an instance where this can be built is essential.

Currently one can compile a NEFF program from neuronx-cc CLI, and it may help to have these pre-compiled in a build step for essential GGML ops - very similar to the .cu files from CUDA. If CLI is present, we need to feed a bunch of .hlo|json|pb|proto files to them, as evidenced in their documentation to produce .neff files. These pre-compiled files can then be loaded via libnrt and then executed.

How a project like vLLM does that is they ensure that apart from what is mentioned in they install the compiler+torch python packages neuronx-cc + torch-neuronx from neuron pip repository and then they compile by python their kernels to HLO, and eventually to NEFF for inference. This is from the vLLM neuron installation documentation.

When someone is willing to look into making and maintaining this functionality into an easy-to-consume build system + library for compiling and loading arbitrary kernels, and maintain the ggml_neuronx runtime and the relevant kernels (such as FlashAttention), it'll keep this project healthy and up-to-date.

CC: @ggerganov

ggerganov · 2024-08-13T09:34:26Z

The best way to get things going is to make a PoC of a ggml_neuronx backend that handles some basic operations like addition or multiplication.

github-actions · 2024-09-27T01:07:23Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

virajkanwade added the enhancement New feature or request label Aug 9, 2024

github-actions bot added the stale label Sep 13, 2024

github-actions bot closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support AWS inferentia inf2 instances #8954

Feature Request: Support AWS inferentia inf2 instances #8954

virajkanwade commented Aug 9, 2024 •

edited

Loading

jeroen-mostert commented Aug 10, 2024

guilt commented Aug 13, 2024 •

edited

Loading

ggerganov commented Aug 13, 2024

github-actions bot commented Sep 27, 2024

Feature Request: Support AWS inferentia inf2 instances #8954

Feature Request: Support AWS inferentia inf2 instances #8954

Comments

virajkanwade commented Aug 9, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

jeroen-mostert commented Aug 10, 2024

guilt commented Aug 13, 2024 • edited Loading

ggerganov commented Aug 13, 2024

github-actions bot commented Sep 27, 2024

virajkanwade commented Aug 9, 2024 •

edited

Loading

guilt commented Aug 13, 2024 •

edited

Loading