feat: invoke anthropic SDK to call Claude (#197)

* feat: invoke anthropic SDK * chore: set response format for extract * fix: do not throw if waitUntilNetworkIdle failed in aiAction * fix: timeout config for Puppeteer * chore: add instruction for connectivity test
web-infra-dev · Dec 23, 2024 · f3d46b5 · f3d46b5
1 parent 229ffb0
commit f3d46b5
Show file tree

Hide file tree

Showing 10 changed files with 274 additions and 104 deletions.
diff --git a/apps/site/docs/en/model-provider.md b/apps/site/docs/en/model-provider.md
@@ -40,7 +40,7 @@ export MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON='{"apiVersion": "2024-11-01-previe
 
 ## Choose a model other than `gpt-4o`
 
-We find that `gpt-4o` performs the best for Midscene at this moment. The other known supported models are: `gemini-1.5-pro`, `qwen-vl-max-latest`, `doubao-vision-pro-32k`
+We find that `gpt-4o` performs the best for Midscene at this moment. The other known supported models are `claude-3-opus-20240229`, `gemini-1.5-pro`, `qwen-vl-max-latest`, `doubao-vision-pro-32k`
 
 If you want to use other models, please follow these steps:
 
@@ -49,6 +49,18 @@ If you want to use other models, please follow these steps:
 3. If you find it not working well after changing the model, you can try using some short and clear prompt (or roll back to the previous model). See more details in [Prompting Tips](./prompting-tips.html).
 4. Remember to follow the terms of use of each model.
 
+## Example: Using `claude-3-opus-20240229` from Anthropic
+
+When configuring `MIDSCENE_USE_ANTHROPIC_SDK=1`, Midscene will use Anthropic SDK (`@anthropic-ai/sdk`) to call the model.
+
+Configure the environment variables:
+
+```bash
+export MIDSCENE_USE_ANTHROPIC_SDK=1
+export ANTHROPIC_API_KEY="....."
+export MIDSCENE_MODEL_NAME="claude-3-opus-20240229"
+```
+
 ## Example: Using `gemini-1.5-pro` from Google
 
 Configure the environment variables:
@@ -80,3 +92,9 @@ export OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
 export OPENAI_API_KEY="..."
 export MIDSCENE_MODEL_NAME="ep-202....."
 ```
+
+## Troubleshooting LLM Service Connectivity Issues
+
+If you want to troubleshoot connectivity issues, you can use the 'connectivity-test' folder in our example project: [https://github.com/web-infra-dev/midscene-example/tree/main/connectivity-test](https://github.com/web-infra-dev/midscene-example/tree/main/connectivity-test)
+
+Put your `.env` file in the `connectivity-test` folder, and run the test with `npm i && npm run test`.
diff --git a/apps/site/docs/zh/model-provider.md b/apps/site/docs/zh/model-provider.md
@@ -37,7 +37,7 @@ export MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON='{"apiVersion": "2024-11-01-previe
 
 ## 选用 `gpt-4o` 以外的其他模型
 
-我们发现 `gpt-4o` 是目前表现最佳的模型。其他已知支持的模型有：`qwen-vl-max-latest` (千问), `gemini-1.5-pro`, `doubao-vision-pro-32k` (豆包)
+我们发现 `gpt-4o` 是目前表现最佳的模型。其他已知支持的模型有：`claude-3-opus-20240229`, `gemini-1.5-pro`, `qwen-vl-max-latest` (千问）, `doubao-vision-pro-32k`  (豆包）
 
 如果你想要使用其他模型，请遵循以下步骤：
 
@@ -46,24 +46,36 @@ export MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON='{"apiVersion": "2024-11-01-previe
 3. 如果发现使用新模型后效果不佳，可以尝试使用一些简短且清晰的提示词（或回滚到之前的模型）。更多详情请参阅 [Prompting Tips](./prompting-tips.html)。
 4. 请遵守各模型的使用条款。
 
-## 示例：使用 Google 的 `gemini-1.5-pro` 模型
+## 示例：使用阿里云的 `qwen-vl-max-latest` 模型
 
 配置环境变量：
 
 ```bash
-export OPENAI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai"
-export OPENAI_API_KEY="....."
-export MIDSCENE_MODEL_NAME="gemini-1.5-pro"
+export OPENAI_API_KEY="sk-..."
+export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
+export MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
 ```
 
-## 示例：使用阿里云的 `qwen-vl-max-latest` 模型
+## 示例：使用 Anthropic 的 `claude-3-opus-20240229` 模型
+
+当配置 `MIDSCENE_USE_ANTHROPIC_SDK=1` 时，Midscene 会使用 Anthropic SDK (`@anthropic-ai/sdk`) 来调用模型。
 
 配置环境变量：
 
 ```bash
-export OPENAI_API_KEY="sk-..."
-export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
-export MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
+export MIDSCENE_USE_ANTHROPIC_SDK=1
+export ANTHROPIC_API_KEY="....."
+export MIDSCENE_MODEL_NAME="claude-3-opus-20240229"
+```
+
+## 示例：使用 Google 的 `gemini-1.5-pro` 模型
+
+配置环境变量：
+
+```bash
+export OPENAI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai"
+export OPENAI_API_KEY="....."
+export MIDSCENE_MODEL_NAME="gemini-1.5-pro"
 ```
 
 ## 示例：使用火山云的豆包 `doubao-vision-pro-32k` 模型
@@ -77,3 +89,9 @@ export OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
 export OPENAI_API_KEY="..."
 export MIDSCENE_MODEL_NAME="ep-202....."
 ```
+
+## 调试 LLM 服务连接问题
+
+如果你想要调试 LLM 服务连接问题，可以使用示例项目中的 `connectivity-test` 目录：[https://github.com/web-infra-dev/midscene-example/tree/main/connectivity-test](https://github.com/web-infra-dev/midscene-example/tree/main/connectivity-test)
+
+将你的 `.env` 文件放在 `connectivity-test` 文件夹中，然后运行 `npm i && npm run test` 来查看问题。
diff --git a/packages/midscene/package.json b/packages/midscene/package.json
@@ -37,6 +37,7 @@
     "prepublishOnly": "npm run build"
   },
   "dependencies": {
+    "@anthropic-ai/sdk": "0.33.1",
     "@azure/identity": "4.5.0",
     "@midscene/shared": "workspace:*",
     "dirty-json": "0.9.2",

diff --git a/packages/midscene/src/ai-model/common.ts b/packages/midscene/src/ai-model/common.ts
@@ -1,11 +1,13 @@
+import assert from 'node:assert';
 import { MIDSCENE_MODEL_TEXT_ONLY, getAIConfig } from '@/env';
 import type { AIUsageInfo } from '@/types';
+
 import type {
   ChatCompletionContentPart,
   ChatCompletionSystemMessageParam,
   ChatCompletionUserMessageParam,
 } from 'openai/resources';
-import { callToGetJSONObject, preferOpenAIModel } from './openai';
+import { callToGetJSONObject, checkAIConfig } from './openai';
 
 export type AIArgs = [
   ChatCompletionSystemMessageParam,
@@ -24,17 +26,16 @@ export async function callAiFn<T>(options: {
   AIActionType: AIActionType;
 }): Promise<{ content: T; usage?: AIUsageInfo }> {
   const { msgs, AIActionType: AIActionTypeValue } = options;
-  if (preferOpenAIModel('openAI')) {
-    const { content, usage } = await callToGetJSONObject<T>(
-      msgs,
-      AIActionTypeValue,
-    );
-    return { content, usage };
-  }
+  assert(
+    checkAIConfig(),
+    'Cannot find config for AI model service. You should set it before using. https://midscenejs.com/model-provider.html',
+  );
 
-  throw Error(
-    'Cannot find OpenAI config. You should set it before using. https://midscenejs.com/model-provider.html',
+  const { content, usage } = await callToGetJSONObject<T>(
+    msgs,
+    AIActionTypeValue,
   );
+  return { content, usage };
 }
 
 export function transformUserMessages(msgs: ChatCompletionContentPart[]) {

diff --git a/packages/midscene/src/ai-model/openai/index.ts b/packages/midscene/src/ai-model/openai/index.ts
@@ -1,5 +1,6 @@
 import assert from 'node:assert';
 import { AIResponseFormat, type AIUsageInfo } from '@/types';
+import { Anthropic } from '@anthropic-ai/sdk';
 import {
   DefaultAzureCredential,
   getBearerTokenProvider,
@@ -10,6 +11,7 @@ import OpenAI, { AzureOpenAI } from 'openai';
 import type { ChatCompletionMessageParam } from 'openai/resources';
 import { SocksProxyAgent } from 'socks-proxy-agent';
 import {
+  ANTHROPIC_API_KEY,
   MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON,
   MIDSCENE_AZURE_OPENAI_SCOPE,
   MIDSCENE_DANGEROUSLY_PRINT_ALL_CONFIG,
@@ -18,6 +20,7 @@ import {
   MIDSCENE_MODEL_NAME,
   MIDSCENE_OPENAI_INIT_CONFIG_JSON,
   MIDSCENE_OPENAI_SOCKS_PROXY,
+  MIDSCENE_USE_ANTHROPIC_SDK,
   MIDSCENE_USE_AZURE_OPENAI,
   OPENAI_API_KEY,
   OPENAI_BASE_URL,
@@ -31,10 +34,11 @@ import { findElementSchema } from '../prompt/element_inspector';
 import { planSchema } from '../prompt/planning';
 import { assertSchema } from '../prompt/util';
 
-export function preferOpenAIModel(preferVendor?: 'coze' | 'openAI') {
+export function checkAIConfig(preferVendor?: 'coze' | 'openAI') {
   if (preferVendor && preferVendor !== 'openAI') return false;
   if (getAIConfig(OPENAI_API_KEY)) return true;
   if (getAIConfig(MIDSCENE_USE_AZURE_OPENAI)) return true;
+  if (getAIConfig(ANTHROPIC_API_KEY)) return true;
 
   return Boolean(getAIConfig(MIDSCENE_OPENAI_INIT_CONFIG_JSON));
 }
@@ -50,8 +54,11 @@ export function getModelName() {
   return modelName;
 }
 
-async function createOpenAI() {
-  let openai: OpenAI | AzureOpenAI;
+async function createChatClient(): Promise<{
+  completion: OpenAI.Chat.Completions;
+  style: 'openai' | 'anthropic';
+}> {
+  let openai: OpenAI | AzureOpenAI | undefined;
   const extraConfig = getAIConfigInJson(MIDSCENE_OPENAI_INIT_CONFIG_JSON);
 
   const socksProxy = getAIConfig(MIDSCENE_OPENAI_SOCKS_PROXY);
@@ -65,7 +72,7 @@ async function createOpenAI() {
       httpAgent: socksAgent,
       ...extraConfig,
       dangerouslyAllowBrowser: true,
-    });
+    }) as OpenAI;
   } else if (getAIConfig(MIDSCENE_USE_AZURE_OPENAI)) {
     // sample code: https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/openai/openai/samples/cookbook/simpleCompletionsPage/app.js
     const scope = getAIConfig(MIDSCENE_AZURE_OPENAI_SCOPE);
@@ -87,7 +94,7 @@ async function createOpenAI() {
       ...extraConfig,
       ...extraAzureConfig,
     });
-  } else {
+  } else if (!getAIConfig(MIDSCENE_USE_ANTHROPIC_SDK)) {
     openai = new OpenAI({
       baseURL: getAIConfig(OPENAI_BASE_URL),
       apiKey: getAIConfig(OPENAI_API_KEY),
@@ -97,7 +104,7 @@ async function createOpenAI() {
     });
   }
 
-  if (getAIConfig(MIDSCENE_LANGSMITH_DEBUG)) {
+  if (openai && getAIConfig(MIDSCENE_LANGSMITH_DEBUG)) {
     if (ifInBrowser) {
       throw new Error('langsmith is not supported in browser');
     }
@@ -106,7 +113,30 @@ async function createOpenAI() {
     openai = wrapOpenAI(openai);
   }
 
-  return openai;
+  if (typeof openai !== 'undefined') {
+    return {
+      completion: openai.chat.completions,
+      style: 'openai',
+    };
+  }
+
+  // Anthropic
+  if (getAIConfig(MIDSCENE_USE_ANTHROPIC_SDK)) {
+    const apiKey = getAIConfig(ANTHROPIC_API_KEY);
+    assert(apiKey, 'ANTHROPIC_API_KEY is required');
+    openai = new Anthropic({
+      apiKey,
+    }) as any;
+  }
+
+  if (typeof openai !== 'undefined' && (openai as any).messages) {
+    return {
+      completion: (openai as any).messages,
+      style: 'anthropic',
+    };
+  }
+
+  throw new Error('Openai SDK or Anthropic SDK is not initialized');
 }
 
 export async function call(
@@ -115,32 +145,74 @@ export async function call(
     | OpenAI.ChatCompletionCreateParams['response_format']
     | OpenAI.ResponseFormatJSONObject,
 ): Promise<{ content: string; usage?: AIUsageInfo }> {
-  const openai = await createOpenAI();
+  const { completion, style } = await createChatClient();
   const shouldPrintTiming =
     typeof getAIConfig(MIDSCENE_DEBUG_AI_PROFILE) === 'string';
-  if (getAIConfig(MIDSCENE_DANGEROUSLY_PRINT_ALL_CONFIG)) {
-    console.log(allAIConfig());
-  }
+
   const startTime = Date.now();
   const model = getModelName();
-  const completion = await openai.chat.completions.create({
-    model,
-    messages,
-    response_format: responseFormat,
+  let content: string | undefined;
+  let usage: OpenAI.CompletionUsage | undefined;
+  const commonConfig = {
     temperature: 0.1,
     stream: false,
-    // betas: ['computer-use-2024-10-22'],
-  } as any);
-  shouldPrintTiming &&
-    console.log(
-      'Midscene - AI call',
+    max_tokens: 3000,
+  };
+  if (style === 'openai') {
+    const result = await completion.create({
       model,
-      completion.usage,
-      `${Date.now() - startTime}ms`,
-    );
-  const { content } = completion.choices[0].message;
-  assert(content, 'empty content');
-  return { content, usage: completion.usage };
+      messages,
+      response_format: responseFormat,
+      ...commonConfig,
+      // betas: ['computer-use-2024-10-22'],
+    } as any);
+    shouldPrintTiming &&
+      console.log(
+        'Midscene - AI call',
+        model,
+        result.usage,
+        `${Date.now() - startTime}ms`,
+      );
+    content = result.choices[0].message.content!;
+    assert(content, 'empty content');
+    usage = result.usage;
+  } else if (style === 'anthropic') {
+    const convertImageContent = (content: any) => {
+      if (content.type === 'image_url') {
+        const imgBase64 = content.image_url.url;
+        assert(imgBase64, 'image_url is required');
+        return {
+          source: {
+            type: 'base64',
+            media_type: imgBase64.includes('data:image/png;base64,')
+              ? 'image/png'
+              : 'image/jpeg',
+            data: imgBase64.split(',')[1],
+          },
+          type: 'image',
+        };
+      }
+      return content;
+    };
+
+    const result = await completion.create({
+      model,
+      system: 'You are a versatile professional in software UI automation',
+      messages: messages.map((m) => ({
+        role: 'user',
+        content: Array.isArray(m.content)
+          ? (m.content as any).map(convertImageContent)
+          : m.content,
+      })),
+      response_format: responseFormat,
+      ...commonConfig,
+    } as any);
+    content = (result as any).content[0].text as string;
+    assert(content, 'empty content');
+    usage = result.usage;
+  }
+
+  return { content: content || '', usage };
 }
 
 export async function callToGetJSONObject<T>(
@@ -166,13 +238,14 @@ export async function callToGetJSONObject<T>(
       case AIActionType.EXTRACT_DATA:
         //TODO: Currently the restriction type can only be a json subset of the constraint, and the way the extract api is used needs to be adjusted to limit the user's data to this as well
         // targetResponseFormat = extractDataSchema;
+        responseFormat = { type: AIResponseFormat.JSON };
         break;
       case AIActionType.PLAN:
         responseFormat = planSchema;
         break;
     }
 
-    if (model === 'gpt-4o-2024-05-13') {
+    if (model === 'gpt-4o-2024-05-13' || !responseFormat) {
       responseFormat = { type: AIResponseFormat.JSON };
     }
   }

diff --git a/packages/midscene/src/env.ts b/packages/midscene/src/env.ts
@@ -21,6 +21,9 @@ export const MIDSCENE_AZURE_OPENAI_SCOPE = 'MIDSCENE_AZURE_OPENAI_SCOPE';
 export const MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON =
   'MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON';
 
+export const MIDSCENE_USE_ANTHROPIC_SDK = 'MIDSCENE_USE_ANTHROPIC_SDK';
+export const ANTHROPIC_API_KEY = 'ANTHROPIC_API_KEY';
+
 // @deprecated
 export const OPENAI_USE_AZURE = 'OPENAI_USE_AZURE';
 
@@ -54,6 +57,9 @@ const allConfigFromEnv = () => {
       'https://cognitiveservices.azure.com/.default',
     [MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON]:
       process.env[MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON] || undefined,
+    [MIDSCENE_USE_ANTHROPIC_SDK]:
+      process.env[MIDSCENE_USE_ANTHROPIC_SDK] || undefined,
+    [ANTHROPIC_API_KEY]: process.env[ANTHROPIC_API_KEY] || undefined,
   };
 };