Skip to content

Latest commit

 

History

History
190 lines (147 loc) · 6.4 KB

OPENAI_CHAT_COMPLETION.md

File metadata and controls

190 lines (147 loc) · 6.4 KB

OpenAI Chat Completion

Table of Contents

Introduction

This guide demonstrates how to use the SAP AI SDK for Java to perform chat completion tasks using OpenAI models deployed on SAP AI Core.

Prerequisites

Before using the AI Core module, ensure that you have met all the general requirements outlined in the README.md. Additionally, include the necessary Maven dependency in your project.

Maven Dependencies

Add the following dependency to your pom.xml file:

<dependencies>
    <dependency>
        <groupId>com.sap.ai.sdk.foundationmodels</groupId>
        <artifactId>openai</artifactId>
        <version>${ai-sdk.version}</version>
    </dependency>
</dependencies>

See an example pom in our Spring Boot application

Usage

In addition to the prerequisites above, we assume you have already set up the following to carry out the examples in this guide:

  • A Deployed OpenAI Model in SAP AI Core
    • Refer to How to deploy a model to AI Core for setup instructions.

    • Example deployed model from the AI Core /deployments endpoint
      {
        "id": "d123456abcdefg",
        "deploymentUrl": "https://api.ai.region.aws.ml.hana.ondemand.com/v2/inference/deployments/d123456abcdefg",
        "configurationId": "12345-123-123-123-123456abcdefg",
        "configurationName": "gpt-35-turbo",
        "scenarioId": "foundation-models",
        "status": "RUNNING",
        "statusMessage": null,
        "targetStatus": "RUNNING",
        "lastOperation": "CREATE",
        "latestRunningConfigurationId": "12345-123-123-123-123456abcdefg",
        "ttl": null,
        "details": {
          "scaling": {
            "backendDetails": null,
            "backend_details": {}
          },
          "resources": {
            "backendDetails": null,
            "backend_details": {
              "model": {
                "name": "gpt-35-turbo",
                "version": "latest"
              }
            }
          }
        },
        "createdAt": "2024-07-03T12:44:22Z",
        "modifiedAt": "2024-07-16T12:44:19Z",
        "submissionTime": "2024-07-03T12:44:51Z",
        "startTime": "2024-07-03T12:45:56Z",
        "completionTime": null
      }

Simple chat completion

OpenAiChatCompletionOutput result =
    OpenAiClient.forModel(GPT_35_TURBO)
        .withSystemPrompt("You are a helpful AI")
        .chatCompletion("Hello World! Why is this phrase so famous?");

String resultMessage = result.getContent();

Message history

var systemMessage =
    new OpenAiChatSystemMessage().setContent("You are a helpful assistant");
var userMessage =
    new OpenAiChatUserMessage().addText("Hello World! Why is this phrase so famous?");
var request =
    new OpenAiChatCompletionParameters().addMessages(systemMessage, userMessage);

OpenAiChatCompletionOutput result =
    OpenAiClient.forModel(GPT_35_TURBO).chatCompletion(request);

String resultMessage = result.getContent();

See an example in our Spring Boot application

Chat Completion with Custom Model

OpenAiChatCompletionOutput result =
    OpenAiClient.forModel(new OpenAiModel("model")).chatCompletion(request);

Stream chat completion

It's possible to pass a stream of chat completion delta elements, e.g. from the application backend to the frontend in real-time.

Asynchronous Streaming

This is a blocking example for streaming and printing directly to the console:

String msg = "Can you give me the first 100 numbers of the Fibonacci sequence?";

OpenAiClient client = OpenAiClient.forModel(GPT_35_TURBO);

// try-with-resources on stream ensures the connection will be closed
try (Stream<String> stream = client.streamChatCompletion(msg)) {
    stream.forEach(
        deltaString -> {
            System.out.print(deltaString);
            System.out.flush();
        });
}

Aggregating Total Output

The following example is non-blocking and demonstrates how to aggregate the complete response. Any asynchronous library can be used, such as the classic Thread API.

var message = "Can you give me the first 100 numbers of the Fibonacci sequence?";

OpenAiChatMessage.OpenAiChatUserMessage userMessage =
    new OpenAiChatMessage.OpenAiChatUserMessage().addText(message);
OpenAiChatCompletionParameters requestParameters =
    new OpenAiChatCompletionParameters().addMessages(userMessage);

OpenAiClient client = OpenAiClient.forModel(GPT_35_TURBO);
var totalOutput = new OpenAiChatCompletionOutput();

// Prepare the stream before starting the thread to handle any initialization exceptions
Stream<OpenAiChatCompletionDelta> stream =
    client.streamChatCompletionDeltas(requestParameters);

var streamProcessor =
    new Thread(
        () -> {
          // try-with-resources ensures the stream is closed after processing
          try (stream) {
            stream.peek(totalOutput::addDelta).forEach(System.out::println);
          }
        });

streamProcessor.start(); // Start processing in a separate thread (non-blocking)
streamProcessor.join(); // Wait for the thread to finish (blocking)

// Access aggregated information from total output
Integer tokensUsed = totalOutput.getUsage().getCompletionTokens();
System.out.println("Tokens used: " + tokensUsed);

Spring Boot example

Please find an example in our Spring Boot application. It shows the usage of Spring Boot's ResponseBodyEmitter to stream the chat completion delta messages to the frontend in real-time.