ONNX? #55

altunenes · 2024-11-14T16:41:26Z

I've been working with the emotion2vec model and trying to convert it to ONNX format for deployment purposes. The current implementation is great for PyTorch users, but having ONNX support would enable broader deployment options.

I tried converting the model using torch.onnx.export with various approaches:

Direct conversion of the AutoModel
Creating a wrapper around the model components
Implementing custom forward passes

Main challenges encountered:

Dimension mismatches in the conv1d layers
Issues with the masking mechanism
Difficulties preserving the complete model architecture
Problems with tensor handling between components

Could you please provide guidance on the correct architecture for ONNX conversion Including an example of proper tensor dimensionality through the model? I have converted torch vision models to Onnx before, but the audio models seemed a bit complicated to me :/

thank you very much your work it works really nice!

also see:
modelscope/FunASR#1690

ddlBoJack · 2024-11-18T10:18:42Z

We did not provide onnx model. Welcome contribute :)

thewh1teagle · 2024-12-05T02:56:28Z

We did not provide onnx model. Welcome contribute :)

I'm currently working to understand the model's inputs and outputs. Could you provide detailed information to help others add onnx support? Specifically, I need the exact input and output details.
Thanks.

Update: this is the example I was able to run with this repo

'''
Using the emotion representation model
rec_result only contains {'feats'}
	granularity="utterance": {'feats': [*768]}
	granularity="frame": {feats: [T*768]}
 
python main.py
'''

from funasr import AutoModel
import json
from collections import OrderedDict

# Load the finetuned emotion recognition model
model = AutoModel(model="iic/emotion2vec_base_finetuned")
mapper = ["angry", "disgusted", "fearful", "happy", "neutral", "other", "sad", "surprised", "unknown"]
wav_file = f"audio.wav"
rec_result = model.generate(wav_file, granularity="utterance")
scores = rec_result[0]['scores']

# Prepare the result mapping with emotions and their probabilities
result = {emotion: float(prob) for emotion, prob in zip(mapper, scores)}
# Sort the result in descending order of probability
sorted_result = OrderedDict(sorted(result.items(), key=lambda item: item[1], reverse=True))
print(json.dumps(sorted_result, indent=4))

I didn't find any working example in the repo and had to play with it.
I guess we should understand better how funasr/models/emotion2vec/model.py works.
Basically we need to understand:

what's the expected length and format of audio segment (wav)
what's the features we need to extract from it
how to pass these features to the model
how to parse the output back to labels and probabilities

oddpxl · 2024-12-08T17:54:42Z

I second this - it would be great to understand the details required to make an ONNX - much appreciated @ddlBoJack if you can help us out ?

ddlBoJack · 2024-12-09T10:41:45Z

We did not provide onnx model. Welcome contribute :)

I'm currently working to understand the model's inputs and outputs. Could you provide detailed information to help others add onnx support? Specifically, I need the exact input and output details. Thanks.

Update: this is the example I was able to run with this repo
'''
Using the emotion representation model
rec_result only contains {'feats'}
	granularity="utterance": {'feats': [*768]}
	granularity="frame": {feats: [T*768]}
 
python main.py
'''

from funasr import AutoModel
import json
from collections import OrderedDict

# Load the finetuned emotion recognition model
model = AutoModel(model="iic/emotion2vec_base_finetuned")
mapper = ["angry", "disgusted", "fearful", "happy", "neutral", "other", "sad", "surprised", "unknown"]
wav_file = f"audio.wav"
rec_result = model.generate(wav_file, granularity="utterance")
scores = rec_result[0]['scores']

# Prepare the result mapping with emotions and their probabilities
result = {emotion: float(prob) for emotion, prob in zip(mapper, scores)}
# Sort the result in descending order of probability
sorted_result = OrderedDict(sorted(result.items(), key=lambda item: item[1], reverse=True))
print(json.dumps(sorted_result, indent=4))
I didn't find any working example in the repo and had to play with it. I guess we should understand better how funasr/models/emotion2vec/model.py works. Basically we need to understand:

what's the expected length and format of audio segment (wav)

what's the features we need to extract from it

how to pass these features to the model

how to parse the output back to labels and probabilities

Thank you for contributing the ONNX model of emotion2vec.

There is no limit on the length of audio, because the prediction is the output of the pooling layer.
The audio is 16khz single-channel wav.
I'm not quite sure what you mean by "features". For the finetuned model, raw wav can be used for forward, without the need to extract mel or fbank features.
You can refer to the implementation of FunASR for label mapping

thewh1teagle · 2024-12-10T19:54:37Z

There is no limit on the length of audio, because the prediction is the output of the pooling layer.

The audio is 16khz single-channel wav.

I'm not quite sure what you mean by "features". For the finetuned model, raw wav can be used for forward, without the need to extract mel or fbank features.

You can refer to the implementation of FunASR for label mapping

Cool I didn't knew that the input can be 16khz wav directly.
Which one is the finetuned model I should use?

I tried to convert the .pt file to onnx but it missing some metadata. I guess I need the pytorch class that represent the model. where can I find it?
As for the output, what's the dimension of the output so I can convert it back to labels?
It will be awesome if you can provide as much info as you can regarding the input (wav -> model) and the output (some matrix -> labels) assuming I'm completely dumb. thanks

thewh1teagle mentioned this issue Dec 5, 2024

Add emotion2vec model k2-fsa/sherpa-onnx#1593

Open

thewh1teagle mentioned this issue Dec 11, 2024

emotion2vec onnx modelscope/FunASR#2291

Open

LauraGPT self-assigned this Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX? #55

ONNX? #55

altunenes commented Nov 14, 2024

ddlBoJack commented Nov 18, 2024

thewh1teagle commented Dec 5, 2024 •

edited

Loading

oddpxl commented Dec 8, 2024

ddlBoJack commented Dec 9, 2024

thewh1teagle commented Dec 10, 2024 •

edited

Loading

ONNX? #55

ONNX? #55

Comments

altunenes commented Nov 14, 2024

ddlBoJack commented Nov 18, 2024

thewh1teagle commented Dec 5, 2024 • edited Loading

oddpxl commented Dec 8, 2024

ddlBoJack commented Dec 9, 2024

thewh1teagle commented Dec 10, 2024 • edited Loading

thewh1teagle commented Dec 5, 2024 •

edited

Loading

thewh1teagle commented Dec 10, 2024 •

edited

Loading