Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX? #55

Open
altunenes opened this issue Nov 14, 2024 · 5 comments
Open

ONNX? #55

altunenes opened this issue Nov 14, 2024 · 5 comments
Assignees

Comments

@altunenes
Copy link

I've been working with the emotion2vec model and trying to convert it to ONNX format for deployment purposes. The current implementation is great for PyTorch users, but having ONNX support would enable broader deployment options.

I tried converting the model using torch.onnx.export with various approaches:

Direct conversion of the AutoModel
Creating a wrapper around the model components
Implementing custom forward passes

Main challenges encountered:

Dimension mismatches in the conv1d layers
Issues with the masking mechanism
Difficulties preserving the complete model architecture
Problems with tensor handling between components

Could you please provide guidance on the correct architecture for ONNX conversion Including an example of proper tensor dimensionality through the model? I have converted torch vision models to Onnx before, but the audio models seemed a bit complicated to me :/

thank you very much your work it works really nice!

also see:
modelscope/FunASR#1690

@ddlBoJack
Copy link
Owner

We did not provide onnx model. Welcome contribute :)

@thewh1teagle
Copy link

thewh1teagle commented Dec 5, 2024

We did not provide onnx model. Welcome contribute :)

I'm currently working to understand the model's inputs and outputs. Could you provide detailed information to help others add onnx support? Specifically, I need the exact input and output details.
Thanks.

Update: this is the example I was able to run with this repo

'''
Using the emotion representation model
rec_result only contains {'feats'}
	granularity="utterance": {'feats': [*768]}
	granularity="frame": {feats: [T*768]}
 
python main.py
'''

from funasr import AutoModel
import json
from collections import OrderedDict

# Load the finetuned emotion recognition model
model = AutoModel(model="iic/emotion2vec_base_finetuned")
mapper = ["angry", "disgusted", "fearful", "happy", "neutral", "other", "sad", "surprised", "unknown"]
wav_file = f"audio.wav"
rec_result = model.generate(wav_file, granularity="utterance")
scores = rec_result[0]['scores']

# Prepare the result mapping with emotions and their probabilities
result = {emotion: float(prob) for emotion, prob in zip(mapper, scores)}
# Sort the result in descending order of probability
sorted_result = OrderedDict(sorted(result.items(), key=lambda item: item[1], reverse=True))
print(json.dumps(sorted_result, indent=4))

I didn't find any working example in the repo and had to play with it.
I guess we should understand better how funasr/models/emotion2vec/model.py works.
Basically we need to understand:

  1. what's the expected length and format of audio segment (wav)
  2. what's the features we need to extract from it
  3. how to pass these features to the model
  4. how to parse the output back to labels and probabilities

@oddpxl
Copy link

oddpxl commented Dec 8, 2024

I second this - it would be great to understand the details required to make an ONNX - much appreciated @ddlBoJack if you can help us out ?

@ddlBoJack
Copy link
Owner

We did not provide onnx model. Welcome contribute :)

I'm currently working to understand the model's inputs and outputs. Could you provide detailed information to help others add onnx support? Specifically, I need the exact input and output details. Thanks.

Update: this is the example I was able to run with this repo

'''
Using the emotion representation model
rec_result only contains {'feats'}
	granularity="utterance": {'feats': [*768]}
	granularity="frame": {feats: [T*768]}
 
python main.py
'''

from funasr import AutoModel
import json
from collections import OrderedDict

# Load the finetuned emotion recognition model
model = AutoModel(model="iic/emotion2vec_base_finetuned")
mapper = ["angry", "disgusted", "fearful", "happy", "neutral", "other", "sad", "surprised", "unknown"]
wav_file = f"audio.wav"
rec_result = model.generate(wav_file, granularity="utterance")
scores = rec_result[0]['scores']

# Prepare the result mapping with emotions and their probabilities
result = {emotion: float(prob) for emotion, prob in zip(mapper, scores)}
# Sort the result in descending order of probability
sorted_result = OrderedDict(sorted(result.items(), key=lambda item: item[1], reverse=True))
print(json.dumps(sorted_result, indent=4))

I didn't find any working example in the repo and had to play with it. I guess we should understand better how funasr/models/emotion2vec/model.py works. Basically we need to understand:

  1. what's the expected length and format of audio segment (wav)
  2. what's the features we need to extract from it
  3. how to pass these features to the model
  4. how to parse the output back to labels and probabilities

Thank you for contributing the ONNX model of emotion2vec.

  1. There is no limit on the length of audio, because the prediction is the output of the pooling layer.
  2. The audio is 16khz single-channel wav.
  3. I'm not quite sure what you mean by "features". For the finetuned model, raw wav can be used for forward, without the need to extract mel or fbank features.
  4. You can refer to the implementation of FunASR for label mapping

@thewh1teagle
Copy link

thewh1teagle commented Dec 10, 2024

  1. There is no limit on the length of audio, because the prediction is the output of the pooling layer.
  2. The audio is 16khz single-channel wav.
  3. I'm not quite sure what you mean by "features". For the finetuned model, raw wav can be used for forward, without the need to extract mel or fbank features.
  4. You can refer to the implementation of FunASR for label mapping

Cool I didn't knew that the input can be 16khz wav directly.
Which one is the finetuned model I should use?

I tried to convert the .pt file to onnx but it missing some metadata. I guess I need the pytorch class that represent the model. where can I find it?
As for the output, what's the dimension of the output so I can convert it back to labels?
It will be awesome if you can provide as much info as you can regarding the input (wav -> model) and the output (some matrix -> labels) assuming I'm completely dumb. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants