Description
Add dedicated transcription interface for audio-to-text models
Current Behavior
The README currently shows audio transcription support through the chat interface:
# Analyze audio recordings
chat.ask 'Describe this meeting', with: { audio: 'meeting.wav' }
However, this doesn't work. The library includes specific transcription models (gpt-4o-transcribe
, gpt-4o-mini-transcribe
) but attempting to use them results in errors. These models are distinct from audio conversation models (gpt-4o-audio-preview
) and text-to-speech models (gpt-4o-mini-tts
).
-
Using chat interface fails because transcription models aren't chat models:
chat = RubyLLM.chat(model: 'gpt-4o-transcribe') chat.ask('Transcribe this', with: { audio: 'audio.mp3' }) # Error: This is not a chat model and thus not supported in the v1/chat/completions endpoint
-
No dedicated transcription method exists:
RubyLLM.transcribe('audio.mp3', model: 'gpt-4o-transcribe') # Error: undefined method 'transcribe' for module RubyLLM
Desired Behavior
Add a dedicated transcription interface consistent with other RubyLLM operations:
# Simple usage
transcription = RubyLLM.transcribe('audio.mp3', model: 'gpt-4o-transcribe')
puts transcription.text
# With options
transcription = RubyLLM.transcribe('audio.mp3',
model: 'gpt-4o-transcribe',
language: 'en', # Optional language hint
prompt: 'This is a technical discussion' # Optional context
)
This would:
- Provide a consistent interface for audio transcription
- Support different transcription models
- Match the pattern of other RubyLLM operations (chat, paint, embed)
- Allow for future expansion to other providers' transcription models
Current Workaround
Until this feature is implemented, users need to use the OpenAI client directly for transcription:
def transcribe_audio
client = OpenAI::Client.new
File.open('audio.mp3', 'rb') do |file|
transcription = client.audio.transcribe(
parameters: {
model: 'gpt-4o-transcribe',
file: file
}
)
transcription['text']
end
end
Documentation
The README needs to be updated to remove the misleading example of audio support through the chat interface. Instead, it should document the new dedicated transcription interface, making it clear that audio processing is a separate operation from chat, similar to how image generation (paint
) and embeddings are handled.