HuggingBento: A Bento-flavoured distro running Hugging Face Transformers #108

gregfurman · 2024-08-24T17:37:30Z

What is this?

Create distribution of Bento for usage with NLP pipelines. This uses the knights-analytics/hugot library to allow for running a Hugging Face pipeline with an ONNX model using Go.

It introduces three new components:

nlp_classify_text for text classification pipelines (processor_text_classifier.go)
nlp_classify_tokens for token and NER classification pipelines (processor_token_classifier.go)
nlp_extract_features for feature extraction pipelines (processor_feature_extractor.go)

Since there is a lot of config overlap between all of these processors, a single processor.go file defines config that is shared amongst all processor types.

All of these will use a shared ONNX Runtime session that is atomically initialised upon creation of one or more HuggingBento processors. This is required to interact with the underlying ONNX Runtime library and can only have a single session created at a time (which required some work when integration testing to ensure runs were not flaky).

Building HuggingBento

Note: the Go build tag huggingbento is used to ensure all files in this distro are only compiled when specified necessary.

Docker

Run the below to build a new image on your local (without using any cached layers).

docker build --platform=linux/amd64  -f resources/huggingbento/Dockerfile -t warpstreamlabs/huggingbento:latest --no-cache .

Binary

Follow the instructions in the README at resources/huggingbento/README.md for local installation instructions to get the required external dependencies (C bindings for tokenizer and ONNX Runtime dynamic library).
Build with make huggingbento

Testing

Integration Tests

Running integration tests (on you local) will require the dependencies listed above. There is a test for each of the

Steps to manually test

Once completed, create a Bento config with the following content in config.yaml:

input:
  generate:
    interval: '@every 10s'
    batch_size: 5
    mapping: root = "Japanese Bento boxes taste amazing!"

pipeline:
  processors:
    - nlp_classify_text:
        pipeline_name: classify-incoming-data
        problem_type: multiLabel
        enable_model_download: true
        model_download_options:
          model_repository: KnightsAnalytics/distilbert-base-uncased-finetuned-sst-2-english

# Out: [{"Label":"NEGATIVE","Score":0.00014481653},{"Label":"POSITIVE","Score":0.99985516}]
# ...

This will load a processor for classifying the sentiment of text using the KnightsAnalytics/distilbert-base-uncased-finetuned-sst-2-english model. It will also download the model and relevant files from the huggingface repository.
The model should download, and once completed, you should get 1-2 batches of identical output: [{"Label":"NEGATIVE","Score":0.00014481653},{"Label":"POSITIVE","Score":0.99985516}].

TODO

Fix GitHub workflow for release and testing this in CI
Write a specififc component guide for usage like with serverless
Implement pipeline for zero shot evaluation
Perhaps run tests inside docker-compose to allow for local testing.
Better describe the fields of each component (i.e NewStringAnnotatedEnumField)
Make a generate/ directory to allow for generating ONNX runtime's and huggingface bindings for any OS/ARCH combo like with ollama which has multiple generate scripts.

…rmer NLP pipelines

internal/impl/huggingface/processor_feature_extraction.go

jem-davies · 2024-09-04T20:31:34Z

internal/impl/huggingface/processor_text_classification.go

+          model_repository: "KnightsAnalytics/distilbert-base-uncased-finetuned-sst-2-english"
+
+
+# In: "This meal tastes like old boots."


I think that it would be better such that you can provide the huggingface processors with a bloblang mapping for the input. You could keep it this way but I think I would assume that the data coming in would be in a json format and the user would have to know to apply a mapping to it / use a branch processor.

Like this is the way the http processor works: it requires you to use it with a branch processor in a way, but I think that is harder to understand than a bloblang mapping field for a new user.

resources/huggingbento/README.md

jem-davies · 2024-09-05T20:47:28Z

resources/huggingbento/install.sh

+#!/bin/bash
+
+ONNXRUNTIME_VERSION=${ONNXRUNTIME_VERSION:-"1.18.0"} 
+DEPENDENCY_DEST=${DEPENDENCY_DEST:-"/usr/lib"} 


set this to /usr/lib/localon macOS like the README.md?

The script assumes Linux which where the above would work. Not sure if changing this to Mac by default will confuse people more. Perhaps I'll add a comment mentioning this

jem-davies · 2024-09-05T20:51:03Z

website/docs/components/processors/nlp_classify_text.md

+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+:::caution BETA


I think that I would have it on experimental because having at BETA limits what can be changed outside of a major release - what do you think?

Yeah that makes sense. Will change.

jem-davies · 2024-09-08T17:34:53Z

website/docs/components/processors/nlp_classify_text.md

I think that there needs to be a way of including these processors somewhere else so that they don't appear on the website. i.e. moved to somewhere like serverless.

What about adding an admonition at the top of the docs saying this is only availble in the huggingbento distro? I think trying to generate the docs into a new location is do-able but could end up being more trouble than it's worth if a text-block could suffice. Thoughts?

Co-authored-by: Jem Davies <[email protected]>

RJKeevil · 2024-11-27T17:09:13Z

Please see knights-analytics/hugot#59 for how to remove the ORT dependency for Hugot, would love to see this happen!

huggingbento: Create Bento distro for usage with Hugging Face Transfo…

bd3b7d6

…rmer NLP pipelines

gregfurman requested a review from jem-davies August 24, 2024 17:37

gregfurman self-assigned this Aug 24, 2024

jem-davies reviewed Sep 8, 2024

View reviewed changes

gregfurman and others added 2 commits September 16, 2024 18:23

huggingbento: Grammar and phrasing improvements

654e5a4

Co-authored-by: Jem Davies <[email protected]>

huggingbento: Fix some bad function naming

f04fd51

Co-authored-by: Jem Davies <[email protected]>

jem-davies force-pushed the main branch 4 times, most recently from cca1170 to dc98f1c Compare November 7, 2024 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HuggingBento: A Bento-flavoured distro running Hugging Face Transformers #108

HuggingBento: A Bento-flavoured distro running Hugging Face Transformers #108

gregfurman commented Aug 24, 2024 •

edited

Loading

jem-davies Sep 4, 2024

jem-davies Sep 5, 2024

gregfurman Sep 16, 2024 •

edited

Loading

jem-davies Sep 5, 2024

gregfurman Sep 16, 2024

jem-davies Sep 8, 2024

gregfurman Sep 16, 2024

RJKeevil commented Nov 27, 2024

		model_repository: "KnightsAnalytics/distilbert-base-uncased-finetuned-sst-2-english"


		# In: "This meal tastes like old boots."

HuggingBento: A Bento-flavoured distro running Hugging Face Transformers #108

Are you sure you want to change the base?

HuggingBento: A Bento-flavoured distro running Hugging Face Transformers #108

Conversation

gregfurman commented Aug 24, 2024 • edited Loading

What is this?

Building HuggingBento

Docker

Binary

Testing

Integration Tests

Steps to manually test

TODO

jem-davies Sep 4, 2024

Choose a reason for hiding this comment

jem-davies Sep 5, 2024

Choose a reason for hiding this comment

gregfurman Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

jem-davies Sep 5, 2024

Choose a reason for hiding this comment

gregfurman Sep 16, 2024

Choose a reason for hiding this comment

jem-davies Sep 8, 2024

Choose a reason for hiding this comment

gregfurman Sep 16, 2024

Choose a reason for hiding this comment

RJKeevil commented Nov 27, 2024

gregfurman commented Aug 24, 2024 •

edited

Loading

gregfurman Sep 16, 2024 •

edited

Loading