Support for audio outputs with gpt-4o-audio-preview #5007

raphtlw · 2025-02-27T17:15:24Z

Feature Description

Currently, when using the @ai-sdk/openai provider, audio inputs are accepted, but parameters for telling the model to generate audio output are not.

[01:01:07.246] INFO (6593): Document file type
    ext: "opus"
    mime: "audio/ogg; codecs=opus"
[01:01:07.347] DEBUG (6593): {
  toSend: [
    {
      type: 'file',
      mimeType: 'audio/mpeg',
      data: <Buffer 49 44 33 04 00 00 00 00 00 23 54 53 53 45 00 00 00 0f 00 00 03 4c 61 76 66 36 30 2e 31 36 2e 31 30 30 00 00 00 00 00 00 00 00 00 00 00 ff fb 54 c0 00 ... 20539 more bytes>
    }
  ],
  remindingSystemPrompt: []
}

Body: {
  model: 'gpt-4o-audio-preview',
  temperature: 0,
  messages: [
    {
      role: 'system',
      content: 'You are raphGPT, a large language model created by @raphtlw, based on the GPT-4 architecture.\n' +
        '\n' +
        'Current date: 2/28/2025, 1:01:08 AM\n' +
        '\n' +
        'Image input capabilities: Enabled\n' +
        'Preferred language: english\n' +
        'Yourself: {&quot;id&quot;:7120507228,&quot;is_bot&quot;:true,&quot;first_name&quot;:&quot;raphGPT (dev)&quot;,&quot;username&quot;:&quot;raphgptdevbot&quot;,&quot;can_join_groups&quot;:true,&quot;can_read_all_group_messages&quot;:false,&quot;supports_inline_queries&quot;:false,&quot;can_connect_to_business&quot;:false,&quot;has_main_web_app&quot;:false}\n' +
        '\n' +
        'Personality: \n' +
        '\n' +
        'raphGPT is a direct, no-nonsense conversationalist who communicates with brevity, humor, and spontaneity. Responses should be concise, informal, and to the point—no unnecessary fluff. Use casual phrasing, abbreviations, and quick decision-making. Inject humor or playfulness when appropriate, but keep interactions practical. If something is obvious, acknowledge it briefly. When discussing logistics or plans, prioritize efficiency and straightforwardness. Assume familiarity with the user, responding in a way that mimics natural, relaxed conversation. Avoid overly formal or robotic language.\n' +
        '\n' +
        'You engage in informal, playful conversations, using slang, abbreviations, and memes commonly found in online culture. Your tone is casual, unfiltered, and sometimes irreverent, often responding with short, reactionary phrases.\n' +
        '\n' +
        'Behavior Guidelines:\n' +
        '- Keep responses short and casual — typically one to five words unless more context is needed.\n' +
        `- Use internet slang, gaming lingo, and abbreviations (e.g., "bruh," "L," "cuh," "fr," "wym," "pog," "idk," "lmao," "damn," "ain't no way").\n` +
        '- Occasionally use reactionary emojis (e.g., "💀," "☠️," "😭," "😂").\n' +
        '- Respond in a dry, sarcastic, or ironic manner when appropriate.\n' +
        '- Keep interactions fast-paced, mimicking real-time chat responses.\n' +
        '- Avoid overly formal language and structured responses.\n' +
        '\n' +
        '### Example Responses:\n' +
        'User: "damm this is nice"\n' +
        'Assistant: "watch later"\n' +
        '\n' +
        'User: "lol thx for the slide btw has been a huge help no cap"\n' +
        'Assistant: "lol dam"\n' +
        '\n' +
        'User: "after my ia wanna go courts n ikea to get stuff i wanna get a glass cabinet"\n' +
        'Assistant: "yess would be fun"\n' +
        '\n' +
        'Your main goal is to act like a laid-back, internet-native friend in a casual group chat, but to be helpful and resourceful when necessary.\n' +
        '\n' +
        'Responses should be in lowercase and multiple messages (split up your messages). Denote each message by adding <|message|>.\n' +
        '\n' +
        'As a Telegram bot, users may send video messages also known as telebubbles. You can read PDF documents, and accept ZIP files. ZIP inputs will be unpacked and passed as message inputs.\n' +
        "If a query requires the users' location, Telegram supports location sharing, you can ask them.\n" +
        'If you need to access files for coding tasks, run read_file tool. Use it conservatively as it may overload the context length.\n' +
        "Conserve output tokens as much as possible. Don't produce unnecessary content.\n" +
        'When processing receipts, extract the most important bits of information, in structured format, preferably JSON.\n' +
        '\n' +
        'Always try to answer in the preferred language, even if they use another.\n'
    },
    { role: 'system', content: '' },
    {
      role: 'user',
      content: [
        {
          type: 'input_audio',
          input_audio: {
            data: 'SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2Z...
      properties: {
            walletAddressOrSignature: { type: 'string' },
            instruction: {
              type: 'string',
              description: 'Natural language instruction describing what you want from the address or signature'
            }
          },
          required: [ 'walletAddressOrSignature', 'instruction', [length]: 2 ],
          additionalProperties: false,
          '$schema': 'http://json-schema.org/draft-07/schema#'
        }
      }
    },
    [length]: 12
  ],
  tool_choice: 'auto'
}

I specify audio inputs following the documentation on file inputs

The output did not include audio, so I tried including the audio parameters under providerOptions:

const {
      text: finalResponse,
      response,
      usage,
      ...rest
    } = await generateText({
      model: openai("gpt-4o-audio-preview"),
      // system: ...
      messages,
      maxSteps: 5,
      providerOptions: {
        openai: {
            modalities: ["text", "audio"],
            audio: { voice: "alloy", format: "mp3" },
          },
      },
    });

But the output did not include audio tokens, nor did it contain any audio in the message completions.

Use Cases

We could have better support for #3176 discussed in #646 if we had a generateAudio/streamAudio function.

I would just love to be able to have the model produce audio outputs at the moment.

Additional context

No response

raphtlw · 2025-02-27T18:07:19Z

I wrote a temporary fix for this:

openai.ts

import logger from "@/bot/logger";
import { createOpenAI } from "@ai-sdk/openai";
import {
  CoreMessage,
  generateText,
  GenerateTextOnStepFinishCallback,
  GenerateTextResult,
  IDGenerator,
  JSONValue,
  LanguageModel,
  LanguageModelResponseMetadata,
  LanguageModelUsage,
  LanguageModelV1CallOptions,
  Message,
  Output,
  ProviderMetadata,
  TelemetrySettings,
  ToolCallRepairFunction,
  ToolChoice,
  ToolSet,
} from "ai";
import assert from "assert";
import OpenAI from "openai";
import { inspect } from "util";

type CallSettings = {
  /**
Maximum number of tokens to generate.
   */
  maxTokens?: number;
  /**
Temperature setting. This is a number between 0 (almost no randomness) and
1 (very random).

It is recommended to set either `temperature` or `topP`, but not both.

@default 0
   */
  temperature?: number;
  /**
Nucleus sampling. This is a number between 0 and 1.

E.g. 0.1 would mean that only tokens with the top 10% probability mass
are considered.

It is recommended to set either `temperature` or `topP`, but not both.
   */
  topP?: number;
  /**
Only sample from the top K options for each subsequent token.

Used to remove "long tail" low probability responses.
Recommended for advanced use cases only. You usually only need to use temperature.
   */
  topK?: number;
  /**
Presence penalty setting. It affects the likelihood of the model to
repeat information that is already in the prompt.

The presence penalty is a number between -1 (increase repetition)
and 1 (maximum penalty, decrease repetition). 0 means no penalty.
   */
  presencePenalty?: number;
  /**
Frequency penalty setting. It affects the likelihood of the model
to repeatedly use the same words or phrases.

The frequency penalty is a number between -1 (increase repetition)
and 1 (maximum penalty, decrease repetition). 0 means no penalty.
   */
  frequencyPenalty?: number;
  /**
Stop sequences.
If set, the model will stop generating text when one of the stop sequences is generated.
Providers may have limits on the number of stop sequences.
   */
  stopSequences?: string[];
  /**
The seed (integer) to use for random sampling. If set and supported
by the model, calls will generate deterministic results.
   */
  seed?: number;
  /**
Maximum number of retries. Set to 0 to disable retries.

@default 2
   */
  maxRetries?: number;
  /**
Abort signal.
   */
  abortSignal?: AbortSignal;
  /**
Additional HTTP headers to be sent with the request.
Only applicable for HTTP-based providers.
   */
  headers?: Record<string, string | undefined>;
};

/**
Prompt part of the AI function options.
It contains a system message, a simple text prompt, or a list of messages.
 */
type Prompt = {
  /**
System message to include in the prompt. Can be used with `prompt` or `messages`.
   */
  system?: string;
  /**
A simple text prompt. You can either use `prompt` or `messages` but not both.
 */
  prompt?: string;
  /**
A list of messages. You can either use `prompt` or `messages` but not both.
   */
  messages?: Array<CoreMessage> | Array<Omit<Message, "id">>;
};

interface Output<OUTPUT, PARTIAL> {
  readonly type: "object" | "text";
  injectIntoSystemPrompt(options: {
    system: string | undefined;
    model: LanguageModel;
  }): string | undefined;
  responseFormat: (options: {
    model: LanguageModel;
  }) => LanguageModelV1CallOptions["responseFormat"];
  parsePartial(options: { text: string }):
    | {
        partial: PARTIAL;
      }
    | undefined;
  parseOutput(
    options: {
      text: string;
    },
    context: {
      response: LanguageModelResponseMetadata;
      usage: LanguageModelUsage;
    },
  ): OUTPUT;
}

export type GenerateTextParams<
  TOOLS extends ToolSet,
  OUTPUT = never,
  OUTPUT_PARTIAL = never,
> = CallSettings &
  Prompt & {
    /**
The language model to use.
 */
    model: LanguageModel;
    /**
The tools that the model can call. The model needs to support calling tools.
*/
    tools?: TOOLS;
    /**
The tool choice strategy. Default: 'auto'.
 */
    toolChoice?: ToolChoice<TOOLS>;
    /**
Maximum number of sequential LLM calls (steps), e.g. when you use tool calls. Must be at least 1.

A maximum number is required to prevent infinite loops in the case of misconfigured tools.

By default, it's set to 1, which means that only a single LLM call is made.
 */
    maxSteps?: number;
    /**
Generate a unique ID for each message.
 */
    experimental_generateMessageId?: IDGenerator;
    /**
When enabled, the model will perform additional steps if the finish reason is "length" (experimental).

By default, it's set to false.
 */
    experimental_continueSteps?: boolean;
    /**
Optional telemetry configuration (experimental).
 */
    experimental_telemetry?: TelemetrySettings;
    /**
Additional provider-specific options. They are passed through
to the provider from the AI SDK and enable provider-specific
functionality that can be fully encapsulated in the provider.
*/
    providerOptions?: Record<string, Record<string, JSONValue>>;
    /**
@deprecated Use `providerOptions` instead.
 */
    experimental_providerMetadata?: ProviderMetadata;
    /**
Limits the tools that are available for the model to call without
changing the tool call and result types in the result.
 */
    experimental_activeTools?: Array<keyof TOOLS>;
    /**
Optional specification for parsing structured outputs from the LLM response.
 */
    experimental_output?: Output<OUTPUT, OUTPUT_PARTIAL>;
    /**
A function that attempts to repair a tool call that failed to parse.
 */
    experimental_repairToolCall?: ToolCallRepairFunction<TOOLS>;
    /**
Callback that is called when each step (LLM call) is finished, including intermediate steps.
*/
    onStepFinish?: GenerateTextOnStepFinishCallback<TOOLS>;
    /**
     * Internal. For test use only. May change without notice.
     */
    _internal?: {
      generateId?: IDGenerator;
      currentDate?: () => Date;
    };
  };

export const generateAudio = async <
  TOOLS extends ToolSet,
  OUTPUT = never,
  OUTPUT_PARTIAL = never,
>(
  args: Omit<GenerateTextParams<TOOLS, OUTPUT, OUTPUT_PARTIAL>, "model">,
): Promise<
  GenerateTextResult<TOOLS, OUTPUT> & {
    audio: OpenAI.Chat.Completions.ChatCompletionAudio | null;
  }
> => {
  let rawOutput: string | undefined;

  const customFetch: typeof globalThis.fetch = async (url, options) => {
    logger.debug(url, "Requesting URL");

    if (options) {
      const { body, ...rest } = options;
      logger.debug(rest, "Options");

      if (typeof body === "string") {
        const openaiBody = JSON.parse(body);
        logger.debug(`Body: ${inspect(openaiBody, true, 10, true)}`);

        options.body = JSON.stringify({
          ...openaiBody,
          modalities: ["text", "audio"],
          audio: { voice: "alloy", format: "mp3" },
        });

        logger.debug(options.body);
      }
    }

    try {
      const response = await fetch(url, options);
      rawOutput = await response.clone().text();

      logger.debug(rawOutput, "Raw OpenAI model output");

      return response;
    } catch (error) {
      console.error("Fetch error:", error);
      throw error;
    }
  };

  // Inject custom fetch into OpenAI model
  const openaiAudio = createOpenAI({
    compatibility: "compatible",
    fetch: customFetch,
    name: "openai",
  });

  const generateResult = await generateText({
    ...args,
    model: openaiAudio("gpt-4o-audio-preview"),
  });

  assert(rawOutput, "OpenAI output required at this stage!");

  const rawJson: OpenAI.Chat.Completions.ChatCompletion = JSON.parse(rawOutput);
  const audio = rawJson.choices[0].message.audio ?? null;

  return { ...generateResult, audio };
};

It is a simple generateText wrapper which intercepts the call to OpenAI and adds the necessary parameters for audio generation, and returns the audio as well.

Feel free to tweak it however you wish to.

raphtlw added the enhancement New feature or request label Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for audio outputs with gpt-4o-audio-preview #5007

Support for audio outputs with gpt-4o-audio-preview #5007

raphtlw commented Feb 27, 2025 •

edited

Loading

raphtlw commented Feb 27, 2025

Support for audio outputs with gpt-4o-audio-preview #5007

Support for audio outputs with gpt-4o-audio-preview #5007

Comments

raphtlw commented Feb 27, 2025 • edited Loading

Feature Description

Use Cases

Additional context

raphtlw commented Feb 27, 2025

raphtlw commented Feb 27, 2025 •

edited

Loading