Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support AWS inferentia inf2 instances #8954

Closed
4 tasks done
virajkanwade opened this issue Aug 9, 2024 · 4 comments
Closed
4 tasks done

Feature Request: Support AWS inferentia inf2 instances #8954

virajkanwade opened this issue Aug 9, 2024 · 4 comments
Labels
enhancement New feature or request stale

Comments

@virajkanwade
Copy link

virajkanwade commented Aug 9, 2024

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

AWS inf2 instances are supposed to provide better performance at cheaper rates. It would be great to support these instances.

Motivation

#2109
ollama/ollama#6143

Possible Implementation

No response

@virajkanwade virajkanwade added the enhancement New feature or request label Aug 9, 2024
@jeroen-mostert
Copy link
Contributor

None of this stuff appears to be open source and the interface to Amazon's devices doesn't appear to be documented. You're supposed to use their proprietary compiler to generate executables that can use these devices and Amazon gives you things like PyTorch-compatible wrappers to utilize these from Python.

The high-level C++ APIs for this only manage loading these executables and feeding them tensors. If you wanted to implement your own inference from scratch and run custom software like llama.cpp, you would need to either reverse engineer the device driver interface (and/or compiler, possibly using the Python code as a guide) or Amazon would need to release docs for it. This is not a "bring your own framework" setup, at least not yet. The fact that Amazon manages most of this stuff for you and they only have compatibility with their own software to worry about probably contributes to why they can offer those cheaper rates (aside from the hardware itself, of course).

@guilt
Copy link
Contributor

guilt commented Aug 13, 2024

This was discussed also in #2109 ;

I think that getting access to aws-neuronx-dkms (Kernel Driver), aws-neuronx-collectives+aws-neuronx-runtime-lib (Runtime) and aws-neuronx-tools (Tools) on an instance where this can be built is essential.

Currently one can compile a NEFF program from neuronx-cc CLI, and it may help to have these pre-compiled in a build step for essential GGML ops - very similar to the .cu files from CUDA. If CLI is present, we need to feed a bunch of .hlo|json|pb|proto files to them, as evidenced in their documentation to produce .neff files. These pre-compiled files can then be loaded via libnrt and then executed.

How a project like vLLM does that is they ensure that apart from what is mentioned in they install the compiler+torch python packages neuronx-cc + torch-neuronx from neuron pip repository and then they compile by python their kernels to HLO, and eventually to NEFF for inference. This is from the vLLM neuron installation documentation.

When someone is willing to look into making and maintaining this functionality into an easy-to-consume build system + library for compiling and loading arbitrary kernels, and maintain the ggml_neuronx runtime and the relevant kernels (such as FlashAttention), it'll keep this project healthy and up-to-date.

CC: @ggerganov

@ggerganov
Copy link
Owner

The best way to get things going is to make a PoC of a ggml_neuronx backend that handles some basic operations like addition or multiplication.

@github-actions github-actions bot added the stale label Sep 13, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

4 participants