Skip to content

Add dedicated transcription interface for audio-to-text models #92

Open
@keithrbennett

Description

@keithrbennett

Add dedicated transcription interface for audio-to-text models

Current Behavior

The README currently shows audio transcription support through the chat interface:

# Analyze audio recordings
chat.ask 'Describe this meeting', with: { audio: 'meeting.wav' }

However, this doesn't work. The library includes specific transcription models (gpt-4o-transcribe, gpt-4o-mini-transcribe) but attempting to use them results in errors. These models are distinct from audio conversation models (gpt-4o-audio-preview) and text-to-speech models (gpt-4o-mini-tts).

  • Using chat interface fails because transcription models aren't chat models:

    chat = RubyLLM.chat(model: 'gpt-4o-transcribe')
    chat.ask('Transcribe this', with: { audio: 'audio.mp3' })
    # Error: This is not a chat model and thus not supported in the v1/chat/completions endpoint
  • No dedicated transcription method exists:

    RubyLLM.transcribe('audio.mp3', model: 'gpt-4o-transcribe')
    # Error: undefined method 'transcribe' for module RubyLLM

Desired Behavior

Add a dedicated transcription interface consistent with other RubyLLM operations:

# Simple usage
transcription = RubyLLM.transcribe('audio.mp3', model: 'gpt-4o-transcribe')
puts transcription.text

# With options
transcription = RubyLLM.transcribe('audio.mp3',
  model: 'gpt-4o-transcribe',
  language: 'en',  # Optional language hint
  prompt: 'This is a technical discussion'  # Optional context
)

This would:

  1. Provide a consistent interface for audio transcription
  2. Support different transcription models
  3. Match the pattern of other RubyLLM operations (chat, paint, embed)
  4. Allow for future expansion to other providers' transcription models

Current Workaround

Until this feature is implemented, users need to use the OpenAI client directly for transcription:

def transcribe_audio
  client = OpenAI::Client.new
  File.open('audio.mp3', 'rb') do |file|
    transcription = client.audio.transcribe(
      parameters: {
        model: 'gpt-4o-transcribe',
        file: file
      }
    )
    transcription['text']
  end
end

Documentation

The README needs to be updated to remove the misleading example of audio support through the chat interface. Instead, it should document the new dedicated transcription interface, making it clear that audio processing is a separate operation from chat, similar to how image generation (paint) and embeddings are handled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions