Hello,
I see that the prompt format used for language IDs in nvidia/nemotron-3.5-asr-streaming-0.6b is that num_prompt features representing a one hot encoded language ID is concatenated to each time step of the encoder output then goes through a projection module
I wonder why that approach was used as opposed to a standard approach used in transformer decoders, which is to pass the language id token along with BOS token to the decoder
This is the approach used in Whisper and Canary, and I don't see why it can't be used with an RNN-T decoder
Thanks
Hello,
I see that the prompt format used for language IDs in
nvidia/nemotron-3.5-asr-streaming-0.6bis thatnum_promptfeatures representing a one hot encoded language ID is concatenated to each time step of the encoder output then goes through a projection moduleI wonder why that approach was used as opposed to a standard approach used in transformer decoders, which is to pass the language id token along with
BOStoken to the decoderThis is the approach used in Whisper and Canary, and I don't see why it can't be used with an RNN-T decoder
Thanks