feat: update the Azure OpenAI integration, add instruction for other …

…models (#193)
web-infra-dev · Dec 20, 2024 · f1b73b2 · f1b73b2
1 parent 48fa8a5
commit f1b73b2
Show file tree

Hide file tree

Showing 15 changed files with 437 additions and 41 deletions.
diff --git a/.gitignore b/.gitignore
@@ -52,6 +52,7 @@ jspm_packages/
 
 # dotenv environment variables file
 .env
+.env.*
 
 # next.js build output
 .next

diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -8,10 +8,12 @@
     "AITEST",
     "Aliyun",
     "aweme",
+    "doubao",
     "douyin",
     "httpbin",
     "iconfont",
     "qwen",
-    "taobao"
+    "taobao",
+    "Volcengine"
   ]
 }
diff --git a/apps/site/docs/en/faq.md b/apps/site/docs/en/faq.md
@@ -2,9 +2,7 @@
 
 ## Can Midscene smartly plan the actions according to my one-line goal? Like executing "Tweet 'hello world'"
 
-Midscene is an automation assistance SDK with a key feature of action stability — ensuring the same actions are performed in each run. To maintain this stability, we encourage you to provide detailed instructions to help the AI understand each step of your task.
-
-If you require a 'goal-to-task' AI planning tool, you can develop one based on Midscene.
+No. Midscene is an automation assistance SDK with a key feature of action stability — ensuring the same actions are performed in each run. To maintain this stability, we encourage you to provide detailed instructions to help the AI understand each step of your task.
 
 Related Docs: [Prompting Tips](./prompting-tips.html)
 
@@ -16,11 +14,9 @@ There are some limitations with Midscene. We are still working on them.
 2. LLM is not 100% stable. Even GPT-4o can't return the right answer all the time. Following the [Prompting Tips](./prompting-tips) will help improve stability.
 3. Since we use JavaScript to retrieve items from the page, the elements inside the iframe cannot be accessed.
 
-## Which LLM should I choose ?
-
-Midscene needs a multimodal Large Language Model (LLM) to understand the UI. Currently, we find that OpenAI's  GPT-4o performs much better than others.
+## Can I use a model other than `gpt-4o`?
 
-You can [customize model and provider](./model-provider.html) if needed.
+Yes. You can [customize model and provider](./model-provider.html) if needed.
 
 ## About the token cost
 

diff --git a/apps/site/docs/en/model-provider.md b/apps/site/docs/en/model-provider.md
@@ -30,13 +30,36 @@ export MIDSCENE_OPENAI_INIT_CONFIG_JSON='{"baseURL":"....","defaultHeaders":{"ke
 export MIDSCENE_OPENAI_SOCKS_PROXY="socks5://127.0.0.1:1080"
 ```
 
-Note:
+## Using Azure OpenAI Service
 
-- Always choose a model that supports vision input. 
-- Currently, the known supported models are: `gpt-4o`, `qwen-vl-max-latest`, `gemini-1.5-pro`
-- Please follow the terms of use of each model.
+```bash
+export MIDSCENE_USE_AZURE_OPENAI=1
+export MIDSCENE_AZURE_OPENAI_SCOPE="https://cognitiveservices.azure.com/.default"
+export MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON='{"apiVersion": "2024-11-01-preview", "endpoint": "...", "deployment": "..."}'
+```
+
+## Choose a model other than `gpt-4o`
+
+We find that `gpt-4o` performs the best for Midscene at this moment. The other known supported models are: `gemini-1.5-pro`, `qwen-vl-max-latest`, `doubao-vision-pro-32k`
+
+If you want to use other models, please follow these steps:
 
-## Example: Using `qwen-vl-max-latest` service from Aliyun
+1. Choose a model that supports image input (a.k.a. multimodal model).
+2. Find out how to to call it with an OpenAI SDK compatible endpoint. Usually you should set the `OPENAI_BASE_URL`, `OPENAI_API_KEY` and `MIDSCENE_MODEL_NAME`.
+3. If you find it not working well after changing the model, you can try using some short and clear prompt (or roll back to the previous model). See more details in [Prompting Tips](./prompting-tips.html).
+4. Remember to follow the terms of use of each model.
+
+## Example: Using `gemini-1.5-pro` from Google
+
+Configure the environment variables:
+
+```bash
+export OPENAI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai"
+export OPENAI_API_KEY="....."
+export MIDSCENE_MODEL_NAME="gemini-1.5-pro"
+```
+
+## Example: Using `qwen-vl-max-latest` from Aliyun
 
 Configure the environment variables:
 
@@ -45,3 +68,15 @@ export OPENAI_API_KEY="sk-..."
 export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
 export MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
 ```
+
+## Example: Using `doubao-vision-pro-32k` from Volcengine
+
+Create a inference point first: https://console.volcengine.com/ark/region:ark+cn-beijing/endpoint
+
+Configure the environment variables:
+
+```bash
+export OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
+export OPENAI_API_KEY="..."
+export MIDSCENE_MODEL_NAME="ep-202....."
+```
diff --git a/apps/site/docs/zh/faq.md b/apps/site/docs/zh/faq.md
@@ -16,11 +16,9 @@ Midscene 存在一些局限性，我们仍在努力改进。
 2. 稳定性风险：即使是 GPT-4o 也无法确保 100% 返回正确答案。遵循 [编写提示词的技巧](./prompting-tips) 可以帮助提高 SDK 稳定性。
 3. 元素访问受限：由于我们使用 JavaScript 从页面提取元素，所以无法访问 iframe 内部的元素。
 
-## 选用那个 LLM 模型？
+## 能否选用 `gpt-4o` 以外的其他模型？
 
-Midscene 需要一个能够理解用户界面的多模态大型语言模型。目前，我们发现 OpenAI 的 GPT-4o 表现最好，远超其它模型。
-
-你可以根据需要[自定义模型和服务商](./model-provider.html)。
+可以。你可以[自定义模型和服务商](./model-provider.html)。
 
 ## 关于 token 成本
 

diff --git a/apps/site/docs/zh/model-provider.md b/apps/site/docs/zh/model-provider.md
@@ -17,9 +17,6 @@ export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"
 # 可选, 如果你想更换 base URL
 export OPENAI_BASE_URL="https://..."
 
-# 可选, 如果你想使用 Azure OpenAI 服务
-export OPENAI_USE_AZURE="true"
-
 # 可选, 如果你想指定模型名称
 export MIDSCENE_MODEL_NAME='qwen-vl-max-lates';
 
@@ -30,12 +27,36 @@ export MIDSCENE_OPENAI_INIT_CONFIG_JSON='{"baseURL":"....","defaultHeaders":{"ke
 export MIDSCENE_OPENAI_SOCKS_PROXY="socks5://127.0.0.1:1080"
 ```
 
-说明：
+## 使用 Azure OpenAI 服务时的配置
+
+```bash
+export MIDSCENE_USE_AZURE_OPENAI=1
+export MIDSCENE_AZURE_OPENAI_SCOPE="https://cognitiveservices.azure.com/.default"
+export MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON='{"apiVersion": "2024-11-01-preview", "endpoint": "...", "deployment": "..."}'
+```
+
+## 选用 `gpt-4o` 以外的其他模型
+
+我们发现 `gpt-4o` 是目前表现最佳的模型。其他已知支持的模型有：`qwen-vl-max-latest` (千问), `gemini-1.5-pro`, `doubao-vision-pro-32k` (豆包)
+
+如果你想要使用其他模型，请遵循以下步骤：
+
+1. 选择一个支持视觉输入的模型（也就是“多模态模型”）。
+2. 找出如何使用 OpenAI SDK 兼容的方式调用它，模型提供商一般都会提供这样的接入点，你需要配置的是 `OPENAI_BASE_URL`, `OPENAI_API_KEY` 和 `MIDSCENE_MODEL_NAME`。
+3. 如果发现使用新模型后效果不佳，可以尝试使用一些简短且清晰的提示词（或回滚到之前的模型）。更多详情请参阅 [Prompting Tips](./prompting-tips.html)。
+4. 请遵守各模型的使用条款。
+
+## 示例：使用 Google 的 `gemini-1.5-pro` 模型
+
+配置环境变量：
 
-- 务必选择一个支持视觉输入的模型。目前我们已知支持的模型有：`gpt-4o`, `qwen-vl-max-latest` (千问), `gemini-1.5-pro`
-- 请遵守各项模型的使用条款
+```bash
+export OPENAI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai"
+export OPENAI_API_KEY="....."
+export MIDSCENE_MODEL_NAME="gemini-1.5-pro"
+```
 
-## 示例：使用部署在阿里云的 `qwen-vl-max-latest` 模型
+## 示例：使用阿里云的 `qwen-vl-max-latest` 模型
 
 配置环境变量：
 
@@ -44,3 +65,15 @@ export OPENAI_API_KEY="sk-..."
 export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
 export MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
 ```
+
+## 示例：使用火山云的豆包 `doubao-vision-pro-32k` 模型
+
+调用前需要配置推理点：https://console.volcengine.com/ark/region:ark+cn-beijing/endpoint
+
+配置环境变量：
+
+```bash
+export OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
+export OPENAI_API_KEY="..."
+export MIDSCENE_MODEL_NAME="ep-202....."
+```
diff --git a/packages/midscene/package.json b/packages/midscene/package.json
@@ -37,7 +37,9 @@
     "prepublishOnly": "npm run build"
   },
   "dependencies": {
+    "@azure/identity": "4.5.0",
     "@midscene/shared": "workspace:*",
+    "dirty-json": "0.9.2",
     "openai": "4.57.1",
     "optional": "0.1.4",
     "socks-proxy-agent": "8.0.4"

diff --git a/packages/midscene/src/ai-model/openai/index.ts b/packages/midscene/src/ai-model/openai/index.ts
@@ -1,16 +1,24 @@
 import assert from 'node:assert';
 import { AIResponseFormat, type AIUsageInfo } from '@/types';
+import {
+  DefaultAzureCredential,
+  getBearerTokenProvider,
+} from '@azure/identity';
 import { ifInBrowser } from '@midscene/shared/utils';
+import dJSON from 'dirty-json';
 import OpenAI, { AzureOpenAI } from 'openai';
 import type { ChatCompletionMessageParam } from 'openai/resources';
 import { SocksProxyAgent } from 'socks-proxy-agent';
 import {
+  MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON,
+  MIDSCENE_AZURE_OPENAI_SCOPE,
   MIDSCENE_DANGEROUSLY_PRINT_ALL_CONFIG,
   MIDSCENE_DEBUG_AI_PROFILE,
   MIDSCENE_LANGSMITH_DEBUG,
   MIDSCENE_MODEL_NAME,
   MIDSCENE_OPENAI_INIT_CONFIG_JSON,
   MIDSCENE_OPENAI_SOCKS_PROXY,
+  MIDSCENE_USE_AZURE_OPENAI,
   OPENAI_API_KEY,
   OPENAI_BASE_URL,
   OPENAI_USE_AZURE,
@@ -26,6 +34,7 @@ import { assertSchema } from '../prompt/util';
 export function preferOpenAIModel(preferVendor?: 'coze' | 'openAI') {
   if (preferVendor && preferVendor !== 'openAI') return false;
   if (getAIConfig(OPENAI_API_KEY)) return true;
+  if (getAIConfig(MIDSCENE_USE_AZURE_OPENAI)) return true;
 
   return Boolean(getAIConfig(MIDSCENE_OPENAI_INIT_CONFIG_JSON));
 }
@@ -47,14 +56,37 @@ async function createOpenAI() {
 
   const socksProxy = getAIConfig(MIDSCENE_OPENAI_SOCKS_PROXY);
   const socksAgent = socksProxy ? new SocksProxyAgent(socksProxy) : undefined;
+
   if (getAIConfig(OPENAI_USE_AZURE)) {
+    // this is deprecated
     openai = new AzureOpenAI({
       baseURL: getAIConfig(OPENAI_BASE_URL),
       apiKey: getAIConfig(OPENAI_API_KEY),
       httpAgent: socksAgent,
       ...extraConfig,
       dangerouslyAllowBrowser: true,
     });
+  } else if (getAIConfig(MIDSCENE_USE_AZURE_OPENAI)) {
+    // sample code: https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/openai/openai/samples/cookbook/simpleCompletionsPage/app.js
+    const scope = getAIConfig(MIDSCENE_AZURE_OPENAI_SCOPE);
+
+    assert(
+      !ifInBrowser,
+      'Azure OpenAI is not supported in browser with Midscene.',
+    );
+    const credential = new DefaultAzureCredential();
+
+    assert(scope, 'MIDSCENE_AZURE_OPENAI_SCOPE is required');
+    const tokenProvider = getBearerTokenProvider(credential, scope);
+
+    const extraAzureConfig = getAIConfigInJson(
+      MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON,
+    );
+    openai = new AzureOpenAI({
+      azureADTokenProvider: tokenProvider,
+      ...extraConfig,
+      ...extraAzureConfig,
+    });
   } else {
     openai = new OpenAI({
       baseURL: getAIConfig(OPENAI_BASE_URL),
@@ -157,12 +189,18 @@ export async function callToGetJSONObject<T>(
   let jsonContent = safeJsonParse(response.content);
   if (jsonContent) return { content: jsonContent, usage: response.usage };
 
-  jsonContent = extractJSONFromCodeBlock(response.content);
+  const cleanJsonString = extractJSONFromCodeBlock(response.content);
   try {
-    return { content: JSON.parse(jsonContent), usage: response.usage };
-  } catch {
-    throw Error(`failed to parse json response: ${response.content}`);
-  }
+    jsonContent = JSON.parse(cleanJsonString);
+  } catch {}
+  if (jsonContent) return { content: jsonContent, usage: response.usage };
+
+  try {
+    jsonContent = dJSON.parse(cleanJsonString);
+  } catch {}
+  if (jsonContent) return { content: jsonContent, usage: response.usage };
+
+  throw Error(`failed to parse json response: ${response.content}`);
 }
 
 export function extractJSONFromCodeBlock(response: string) {

diff --git a/packages/midscene/src/ai-model/openai/types.d.ts b/packages/midscene/src/ai-model/openai/types.d.ts
@@ -0,0 +1 @@
+declare module 'dirty-json';
diff --git a/packages/midscene/src/ai-model/prompt/planning.ts b/packages/midscene/src/ai-model/prompt/planning.ts
@@ -52,7 +52,7 @@ You are a versatile professional in software UI automation. Your outstanding con
 
 - All the actions you composed MUST be based on the page context information you get.
 - Trust the "What have been done" field about the task (if any), don't repeat actions in it.
-- Respond only with valid JSON. Do not write an introduction or summary.
+- Respond only with valid JSON. Do not write an introduction or summary or markdown prefix like \`\`\`json\`.
 - If you cannot plan any action at all (i.e. empty actions array), set reason in the \`error\` field.
 
 ## About the \`actions\` field
@@ -140,7 +140,6 @@ By viewing the page screenshot and description, you should consider this and out
 * The "English" option button is not shown in the screenshot now, it means it may only show after the previous actions are finished. So the last action will have a \`null\` value in the \`locate\` field. 
 * The task cannot be accomplished (because we cannot see the "English" option now), so a \`furtherPlan\` field is needed.
 
-\`\`\`json
 {
   "actions":[
     {
@@ -171,8 +170,6 @@ By viewing the page screenshot and description, you should consider this and out
     "whatHaveDone": "Click the language switch button and wait 1s" 
   }
 }
-\`\`\`
-
 
 ## Example #2 : Tolerate the error situation only when the instruction is an "if" statement
 
@@ -181,7 +178,6 @@ If the user says "If there is a popup, close it", you should consider this and o
 * By viewing the page screenshot and description, you cannot find the popup, so the condition is falsy.
 * The instruction itself is an "if" statement, it means the user can tolerate this situation, so you should leave a \`FalsyConditionStatement\` action.
 
-\`\`\`json
 {
   "actions": [{
       "thought": "There is no popup on the page",
@@ -192,18 +188,15 @@ If the user says "If there is a popup, close it", you should consider this and o
   "taskWillBeAccomplished": true,
   "furtherPlan": null
 }
-\`\`\`
 
 For contrast, if the user says "Close the popup" in this situation, you should consider this and output the JSON:
 
-\`\`\`json
 {
   "actions": [],
   "error": "The instruction and page context are irrelevant, there is no popup on the page",
   "taskWillBeAccomplished": true,
   "furtherPlan": null
 }
-\`\`\`
 
 ## Example #3 : When task is accomplished, don't plan more actions
 
@@ -224,6 +217,7 @@ When the user ask to "Wait 4s", you should consider this:
 ## Bad case #1 : Missing \`prompt\` in the 'Locate' field; Missing \`furtherPlan\` field when the task won't be accomplished
 
 Wrong output:
+
 {
   "actions":[
     {

diff --git a/packages/midscene/src/env.ts b/packages/midscene/src/env.ts
@@ -11,11 +11,19 @@ export const MIDSCENE_OPENAI_SOCKS_PROXY = 'MIDSCENE_OPENAI_SOCKS_PROXY';
 export const OPENAI_API_KEY = 'OPENAI_API_KEY';
 export const OPENAI_BASE_URL = 'OPENAI_BASE_URL';
 export const MIDSCENE_MODEL_TEXT_ONLY = 'MIDSCENE_MODEL_TEXT_ONLY';
-export const OPENAI_USE_AZURE = 'OPENAI_USE_AZURE';
+
 export const MIDSCENE_CACHE = 'MIDSCENE_CACHE';
 export const MATCH_BY_POSITION = 'MATCH_BY_POSITION';
 export const MIDSCENE_REPORT_TAG_NAME = 'MIDSCENE_REPORT_TAG_NAME';
 
+export const MIDSCENE_USE_AZURE_OPENAI = 'MIDSCENE_USE_AZURE_OPENAI';
+export const MIDSCENE_AZURE_OPENAI_SCOPE = 'MIDSCENE_AZURE_OPENAI_SCOPE';
+export const MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON =
+  'MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON';
+
+// @deprecated
+export const OPENAI_USE_AZURE = 'OPENAI_USE_AZURE';
+
 const allConfigFromEnv = () => {
   return {
     [MIDSCENE_OPENAI_INIT_CONFIG_JSON]:
@@ -39,6 +47,13 @@ const allConfigFromEnv = () => {
       process.env[MIDSCENE_REPORT_TAG_NAME] || undefined,
     [MIDSCENE_OPENAI_SOCKS_PROXY]:
       process.env[MIDSCENE_OPENAI_SOCKS_PROXY] || undefined,
+    [MIDSCENE_USE_AZURE_OPENAI]:
+      process.env[MIDSCENE_USE_AZURE_OPENAI] || undefined,
+    [MIDSCENE_AZURE_OPENAI_SCOPE]:
+      process.env[MIDSCENE_AZURE_OPENAI_SCOPE] ||
+      'https://cognitiveservices.azure.com/.default',
+    [MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON]:
+      process.env[MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON] || undefined,
   };
 };
-Original file line number
+Diff line change
@@ Expand Up / @@ -52,6 +52,7 @@ jspm_packages/ @@
     # dotenv environment variables file
     .env
+    .env.*
     # next.js build output
     .next
@@ Expand Down @@