Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planning: Ichigo Transcription #90

Open
dan-menlo opened this issue Oct 18, 2024 · 3 comments
Open

planning: Ichigo Transcription #90

dan-menlo opened this issue Oct 18, 2024 · 3 comments
Assignees

Comments

@dan-menlo
Copy link
Contributor

dan-menlo commented Oct 18, 2024

Goal

  • Ichigo Demo should have transcription of the audio message
  • Likely driven by Eng
  • Data Storage is easier (i.e. can train over it)
  • Whisper Encoder is already in project (i.e. use Decoder)
  • Will not affect latency as this is post-processing

Image

@dan-menlo dan-menlo converted this from a draft issue Oct 18, 2024
@dan-menlo dan-menlo added this to the Ichigo v0.4 milestone Oct 18, 2024
@dan-menlo dan-menlo changed the title epic: Ichigo transcription epic: Ichigo Transcription Oct 18, 2024
@dan-menlo dan-menlo changed the title epic: Ichigo Transcription planning: Ichigo Transcription Oct 18, 2024
@tikikun
Copy link
Collaborator

tikikun commented Nov 11, 2024

@nguyenhoangthuan99 you can pick this up if you like, just extract embedding from encoder and forward it to whisper transcription

@jrohsc
Copy link

jrohsc commented Nov 25, 2024

Hi, how can I do the transcription on a colab notebook? It seems like whenever I give a question audio, it only generates the answer to the question.

@PodsAreAllYouNeed
Copy link

Hi, how can I do the transcription on a colab notebook? It seems like whenever I give a question audio, it only generates the answer to the question.

I've prepared a colab demo with transcription example here: https://colab.research.google.com/drive/1req3ByqKS1vVPF_iGD1sNE2DzvMo7Jd0?usp=sharing

The relevant function for transcription is this:

def audio_to_text(audio_path, target_bandwidth=1.5, device=device):
    vq_model.ensure_whisper(device)
    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    with torch.no_grad():
        codes = vq_model.encode_audio(wav.to(device))
        transcript = vq_model.decode_text(codes[0]) 
    return f'{transcript[0].text}'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Investigating
Development

No branches or pull requests

6 participants