Support the use of Model Inference Servers #2519

zoltan-fedor · 2022-05-08T21:22:45Z

zoltan-fedor
May 8, 2022

Haystack is great in allowing you to chain together multiple NLP steps into a pipeline. Unfortunately we all know how slow inference of some of these large transformer models can be and once you chain multiple of them together into a pipeline, it is even worse.
Also the future of NLP seems to be leading to more complicated pipelines and ever larger transformer models, which means inference speed and cost is becoming a larger and larger issue as time goes.

Unfortunately as of today Haystack does not allow the separation of the model inference into separate model server(s) running on dedicated/custom hardware, while the inference performance and cost is an area where a lot of work is happening across the industry today and I expect even more so in the future. Already today we have custom chips focusing on the speed and cost of model inference - like the AWS Inferentia chip and dedicated inference servers, like Nvidia Triton. Unfortunately today Haystack can't be used with either cost-effectively, because model inference is backed into the codebase, instead of being modular and easily separable to an external service / separate device.

This has already been mentioned last year related to the Reader in this discussion :
"... you can use a 2Gb RAM server for 99% of any [Haystack] pipelines. It will fit most use-cases. However, the Reader part [or any other transformer model inference] will require very specific hardware/RAM and can't scale. So instead of having to scale the Haystack server, we could simply have an outside service for running our Reader only. This external dedicated service could be connected as in the pipelines.yaml"

Today with Haystack you can flip the gpu flag to speed up inference, but that has multiple problems:

It mandates that you run the whole Haystack pipeline app on a GPU node - while really only the model inference that needs to run on it. And GPU instances are VERY expensive, so this is wasteful design. Plus GPU instances are actually not very good at inference, there are better hardwares out there for that and more are coming.
It cannot take advantage of other specialized inference hardware - like AWS Inferentia ASIC chips
It doesn't allow you to scale the application and the model inference separately, nor to share model inference resources between multiple applications which would happen to use the same models.

All-in-all, I don't believe that the current design is supportive of the future of more focus on inference performance and costs of the ever larger NLP models.

Even if you just scroll through the discussions in this git repo, you find multiple references to feature requests related to this:

My recommendation is to refactor Haystack to allow the use of any external model inference service in all places where slow model inference is taking place within the Haystack codebase (any transformer models, any Readers, QA models, etc).

julian-risch · 2022-05-10T07:29:58Z

julian-risch
May 10, 2022
Maintainer

Hi @zoltan-fedor, first of all thanks for sharing your impressions. Yes, there's a lot of things we want to improve for scaling nodes in Haystack more independently and supporting different inference servers. While this is some big piece of work, we currently don't see that it requires a massive refactoring of Haystack, as the pipeline design is "node based" and therefore it’s rather easy to separate out a different reader node and adjust the inference code there to the framework or inference server of your choice. There is already the option to convert a model to ONNX in order to enable access to more hardware optimizations.
We evaluated a couple of options to allow "scaling the model inference separately" and "share model inference resources between multiple applications" and found a great solution with Ray. The implementation in Haystack is still in the early days (see RayPipeline) but we use it in a slightly adjusted way already for many of our production deployments. It allows you to have a flexible cluster of nodes (CPU / GPU) and deploy a pipeline depending on the required resources. In future, this will also allow you to deploy some of your pipeline nodes on a CPU and other pipeline nodes like the Reader on a GPU. Further, it will allow to scale a single pipeline node to multiple GPUs if needed. As we are just a small team and need to balance many priorities on the roadmap, feel free to let us know if you want to contribute here. 🙂

0 replies

zoltan-fedor · 2022-05-10T16:30:57Z

zoltan-fedor
May 10, 2022
Author

Thanks @julian-risch.
Yes, I knew that it is possible to write custom nodes to take advantage of things like ONNC and run on any hardware, but the problem is that then every part of the pipeline must run on the same hardware - but specialized ML Inference hardwares are usually not good and definitely not cost effective at general purposes stuff.
Hence we would need a combination of remote & custom nodes - so the inference could run on ML-specialized hardware away from the rest of the nodes and being communicated to by a node whose role is to communicate with remote nodes.

This is exactly what I will likely end up doing - writing a custom node which would be making REST calls to a remote node, which remote node would be running on some close (network lag-wise), custom ML Inference hardware highly optimized for fast inference.

0 replies

masci · 2022-07-25T16:11:39Z

masci
Jul 25, 2022

More details on the feature request here: #2869

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support the use of Model Inference Servers #2519

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Support the use of Model Inference Servers #2519

zoltan-fedor May 8, 2022

Replies: 3 comments

julian-risch May 10, 2022 Maintainer

zoltan-fedor May 10, 2022 Author

masci Jul 25, 2022

zoltan-fedor
May 8, 2022

julian-risch
May 10, 2022
Maintainer

zoltan-fedor
May 10, 2022
Author

masci
Jul 25, 2022