Support the use of Model Inference Servers #2519
Replies: 3 comments
-
Hi @zoltan-fedor, first of all thanks for sharing your impressions. Yes, there's a lot of things we want to improve for scaling nodes in Haystack more independently and supporting different inference servers. While this is some big piece of work, we currently don't see that it requires a massive refactoring of Haystack, as the pipeline design is "node based" and therefore it’s rather easy to separate out a different reader node and adjust the inference code there to the framework or inference server of your choice. There is already the option to convert a model to ONNX in order to enable access to more hardware optimizations. |
Beta Was this translation helpful? Give feedback.
-
Thanks @julian-risch. This is exactly what I will likely end up doing - writing a custom node which would be making REST calls to a remote node, which remote node would be running on some close (network lag-wise), custom ML Inference hardware highly optimized for fast inference. |
Beta Was this translation helpful? Give feedback.
-
More details on the feature request here: #2869 |
Beta Was this translation helpful? Give feedback.
-
Haystack is great in allowing you to chain together multiple NLP steps into a pipeline. Unfortunately we all know how slow inference of some of these large transformer models can be and once you chain multiple of them together into a pipeline, it is even worse.
Also the future of NLP seems to be leading to more complicated pipelines and ever larger transformer models, which means inference speed and cost is becoming a larger and larger issue as time goes.
Unfortunately as of today Haystack does not allow the separation of the model inference into separate model server(s) running on dedicated/custom hardware, while the inference performance and cost is an area where a lot of work is happening across the industry today and I expect even more so in the future. Already today we have custom chips focusing on the speed and cost of model inference - like the AWS Inferentia chip and dedicated inference servers, like Nvidia Triton. Unfortunately today Haystack can't be used with either cost-effectively, because model inference is backed into the codebase, instead of being modular and easily separable to an external service / separate device.
This has already been mentioned last year related to the Reader in this discussion :
"... you can use a 2Gb RAM server for 99% of any [Haystack] pipelines. It will fit most use-cases. However, the Reader part [or any other transformer model inference] will require very specific hardware/RAM and can't scale. So instead of having to scale the Haystack server, we could simply have an outside service for running our Reader only. This external dedicated service could be connected as in the pipelines.yaml"
Today with Haystack you can flip the
gpu
flag to speed up inference, but that has multiple problems:All-in-all, I don't believe that the current design is supportive of the future of more focus on inference performance and costs of the ever larger NLP models.
Even if you just scroll through the discussions in this git repo, you find multiple references to feature requests related to this:
My recommendation is to refactor Haystack to allow the use of any external model inference service in all places where slow model inference is taking place within the Haystack codebase (any transformer models, any Readers, QA models, etc).
Beta Was this translation helpful? Give feedback.
All reactions