chore: merge main branch

web-infra-dev · Dec 30, 2024 · a028b3c · a028b3c
2 parents dfb15aa + 8042bcc
commit a028b3c
Show file tree

Hide file tree

Showing 100 changed files with 3,587 additions and 9,516 deletions.
diff --git a/.github/ISSUE_TEMPLATE/llm-connectivity-issue---模型连接错误.md b/.github/ISSUE_TEMPLATE/llm-connectivity-issue---模型连接错误.md
@@ -0,0 +1,28 @@
+---
+name: LLM Connectivity Issue / 模型连接错误
+about: How to solve the LLM connectivity problem
+title: "[Connectivity]"
+labels: ''
+assignees: ''
+
+---
+
+## Read this before open issue
+
+How to choose and config a model: https://midscenejs.com/model-provider.html
+
+Use this project to check the connection: https://github.com/web-infra-dev/midscene-example/tree/main/connectivity-test
+
+## If the error persists, tell us these information
+
+- Where are you using Midscene.js (Chrome extension, yaml with cli, Puppeteer,…)
+
+- The version of Midscene.js or Extension
+
+- The error message
+
+- The model name and endpoint (if could be public）
+
+## Security Check
+
+Do NOT include your API key in your issue! Revoke it immediately if it has already been leaked in your issue.
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -22,7 +22,7 @@ jobs:
       with:
         ref: ${{ github.event.inputs.branch }}
     - name: Pushing to the protected branch 'protected'
-      uses: CasperWA/push-protected@v2
+      uses: zhoushaw/push-protected@v2
       with:
         token: ${{ secrets.PUSH_TO_PROTECTED_BRANCH }}
         branch: ${{ github.event.inputs.branch }}

diff --git a/.gitignore b/.gitignore
@@ -52,6 +52,7 @@ jspm_packages/
 
 # dotenv environment variables file
 .env
+.env.*
 
 # next.js build output
 .next

diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -4,5 +4,16 @@
   },
   "editor.defaultFormatter": "biomejs.biome",
   "editor.formatOnSave": true,
-  "cSpell.words": ["AITEST", "aweme", "douyin", "httpbin", "iconfont", "taobao"]
+  "cSpell.words": [
+    "AITEST",
+    "Aliyun",
+    "aweme",
+    "doubao",
+    "douyin",
+    "httpbin",
+    "iconfont",
+    "qwen",
+    "taobao",
+    "Volcengine"
+  ]
 }
diff --git a/README.ja.md b/README.ja.md
@@ -42,6 +42,7 @@ Midscene.jsは、自然言語を使用してページを制御し、アサーシ
 * [YAML形式の自動化スクリプトを使用する](https://midscenejs.com/automate-with-scripts-in-yaml.html)
 * [Puppeteerとの統合](https://midscenejs.com/integrate-with-puppeteer.html)
 * [Playwrightとの統合](https://midscenejs.com/integrate-with-playwright.html)
+* [モデルとサービスプロバイダーのカスタマイズ](https://midscenejs.com/model-provider.html)
 
 ## ライセンス
 

diff --git a/README.md b/README.md
@@ -46,6 +46,14 @@ Midscene.js is an AI-powered automation SDK can control the page, perform assert
 * [Automate with Scripts in YAML](https://midscenejs.com/automate-with-scripts-in-yaml.html)
 * [Integrate with Puppeteer](https://midscenejs.com/integrate-with-puppeteer.html)
 * [Integrate with Playwright](https://midscenejs.com/integrate-with-playwright.html)
+* [Customize Model and Provider](https://midscenejs.com/model-provider.html)
+
+## Community
+
+* [Lark Group](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=291q2b25-e913-411a-8c51-191e59aab14d)
+
+
+  <img src="https://github.com/user-attachments/assets/7c132fbf-37a7-4005-8fb1-59342efdf9b2" alt="lark group link" width="300" />
 
 ## License
 

diff --git a/README.zh.md b/README.zh.md
@@ -31,7 +31,7 @@ Midscene.js 是一个由 AI 驱动的自动化 SDK，能够使用自然语言对
 - **自然语言互动 👆**：只需描述你的步骤，Midscene 会为你规划和操作用户界面
 - **理解UI、JSON格式回答 🔍**：你可以提出关于数据格式的要求，然后得到 JSON 格式的预期回应。
 - **直观断言 🤔**：用自然语言表达你的断言，AI 会理解并处理。
-- **开箱即用的LLM 🪓**：使用公开的多模态大语言模型（ 如GPT-4o ），无需任何定制训练。
+- **开箱即用的LLM 🪓**：支持使用公开的多模态大语言模型（ 如 GPT-4o ），无需任何定制训练。
 - **可视化报告 🎞️**：通过我们的测试报告和 Playground，你可以轻松理解和调试整个过程。
 - **全新体验 🔥**：体验全新的自动化开发世界，尽情享受吧！
 
@@ -43,6 +43,13 @@ Midscene.js 是一个由 AI 驱动的自动化 SDK，能够使用自然语言对
 * [使用 YAML 格式的自动化脚本](https://midscenejs.com/zh/automate-with-scripts-in-yaml.html)
 * [集成到 Puppeteer](https://midscenejs.com/zh/integrate-with-puppeteer.html)
 * [集成到 Playwright](https://midscenejs.com/zh/integrate-with-playwright.html)
+* [自定义模型和服务商](https://midscenejs.com/zh/model-provider.html)
+
+## 社区
+
+* [飞书群](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=291q2b25-e913-411a-8c51-191e59aab14d)
+
+  <img src="https://github.com/user-attachments/assets/7c132fbf-37a7-4005-8fb1-59342efdf9b2" alt="lark group link" width="300" />
 
 
 ## 授权许可

diff --git a/apps/site/docs/en/automate-with-scripts-in-yaml.mdx b/apps/site/docs/en/automate-with-scripts-in-yaml.mdx
@@ -40,7 +40,7 @@ or you can use a `.env` file to store the configuration
 OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"
 ```
 
-or you may [customize model provider](./model-provider.html)
+or you may [customize model and provider](./model-provider.html)
 
 ## Start
 

diff --git a/apps/site/docs/en/faq.md b/apps/site/docs/en/faq.md
@@ -2,9 +2,7 @@
 
 ## Can Midscene smartly plan the actions according to my one-line goal? Like executing "Tweet 'hello world'"
 
-Midscene is an automation assistance SDK with a key feature of action stability — ensuring the same actions are performed in each run. To maintain this stability, we encourage you to provide detailed instructions to help the AI understand each step of your task.
-
-If you require a 'goal-to-task' AI planning tool, you can develop one based on Midscene.
+No. Midscene is an automation assistance SDK with a key feature of action stability — ensuring the same actions are performed in each run. To maintain this stability, we encourage you to provide detailed instructions to help the AI understand each step of your task.
 
 Related Docs: [Prompting Tips](./prompting-tips.html)
 
@@ -16,11 +14,9 @@ There are some limitations with Midscene. We are still working on them.
 2. LLM is not 100% stable. Even GPT-4o can't return the right answer all the time. Following the [Prompting Tips](./prompting-tips) will help improve stability.
 3. Since we use JavaScript to retrieve items from the page, the elements inside the iframe cannot be accessed.
 
-## Which LLM should I choose ?
-
-Midscene needs a multimodal Large Language Model (LLM) to understand the UI. Currently, we find that OpenAI's  GPT-4o performs much better than others.
+## Can I use a model other than `gpt-4o`?
 
-You can [customize model provider](./model-provider.html) if needed.
+Yes. You can [customize model and provider](./model-provider.html) if needed.
 
 ## About the token cost
 

diff --git a/apps/site/docs/en/index.mdx b/apps/site/docs/en/index.mdx
@@ -6,8 +6,6 @@ Introducing Midscene.js, an innovative SDK designed to bring joy back to automat
 
 Midscene.js leverages a multimodal Large Language Model (LLM) to intuitively “understand” your user interface and carry out the necessary actions. You can simply describe the interaction steps or expected data formats, and the AI will handle the execution for you.
 
-Currently, the model we are using by default is the OpenAI GPT-4o model, while you can customize it to a different model if needed.
-
 <div style={{"width": "100%", "display": "flex", justifyContent: "center"}}>
   <iframe
     style={{"maxWidth": "100%", "width": "800px", "height": "450px"}}
@@ -67,3 +65,11 @@ Midscene will provide a visual report after each run. With this report, you can
 ## Just you and model provider, no third-party services
 
 ⁠Midscene.js is an open-source project (GitHub: [Midscene](https://github.com/web-infra-dev/midscene/)) under the MIT license. You can run it in your own environment. All data gathered from pages will be sent directly to OpenAI or the custom model provider according to your configuration. Therefore, only you and the model provider will have access to the data. No third-party platform will access the data.
+
+## Customize Model
+
+Currently, the model we are using by default is the OpenAI GPT-4o model, while you can [customize it to a different multimodal model](./model-provider.html) if needed.
+
+## Start with Chrome Extension
+
+To quickly experience the main features of Midscene, you can use the [Chrome Extension](./quick-experience.html). It allows you to use Midscene on any webpage without writing any code.
diff --git a/apps/site/docs/en/integrate-with-playwright.mdx b/apps/site/docs/en/integrate-with-playwright.mdx
@@ -12,7 +12,7 @@ you can check the demo project of Playwright here: [https://github.com/web-infra
 
 ## Preparation
 
-Config the OpenAI API key, or [customize model provider](./model-provider.html)
+Config the OpenAI API key, or [customize model and provider](./model-provider.html)
 
 ```bash
 # replace with your own

diff --git a/apps/site/docs/en/integrate-with-puppeteer.mdx b/apps/site/docs/en/integrate-with-puppeteer.mdx
@@ -7,11 +7,13 @@ import { PackageManagerTabs } from '@theme';
 
 :::info Demo Project
 you can check the demo project of Puppeteer here: [https://github.com/web-infra-dev/midscene-example/blob/main/puppeteer-demo](https://github.com/web-infra-dev/midscene-example/blob/main/puppeteer-demo)
+
+There is also a demo of Puppeteer with Vitest: [https://github.com/web-infra-dev/midscene-example/tree/main/puppeteer-with-vitest-demo](https://github.com/web-infra-dev/midscene-example/tree/main/puppeteer-with-vitest-demo)
 :::
 
 ## Preparation
 
-Config the OpenAI API key, or [customize model provider](./model-provider.html)
+Config the OpenAI API key, or [customize model and provider](./model-provider.html)
 
 ```bash
 # replace with your own

diff --git a/apps/site/docs/en/model-provider.md b/apps/site/docs/en/model-provider.md
@@ -1,8 +1,8 @@
-# Customize Model Provider
+# Customize Model and Provider
 
-Midscene uses the OpenAI SDK as the default AI service. You can customize the configuration using environment variables.
+Midscene uses the OpenAI SDK to call AI services. You can customize the configuration using environment variables. All the configs can also be used in the [Chrome Extension](./quick-experience.html).
 
-There are the main configs, in which `OPENAI_API_KEY` is required.
+These are the main configs, in which `OPENAI_API_KEY` is required.
 
 Required:
 
@@ -21,11 +21,83 @@ export OPENAI_BASE_URL="https://..."
 export OPENAI_USE_AZURE="true"
 
 # if you want to specify a model name other than gpt-4o
-export MIDSCENE_MODEL_NAME='claude-3-opus-20240229';
+export MIDSCENE_MODEL_NAME='qwen-vl-max-latest';
 
 # if you want to pass customized JSON data to the `init` process of OpenAI SDK
 export MIDSCENE_OPENAI_INIT_CONFIG_JSON='{"baseURL":"....","defaultHeaders":{"key": "value"}}'
 
 # if you want to use proxy. Midscene uses `socks-proxy-agent` under the hood.
 export MIDSCENE_OPENAI_SOCKS_PROXY="socks5://127.0.0.1:1080"
+
+# if you want to specify the max tokens for the model
+export OPENAI_MAX_TOKENS=2048
+```
+
+## Using Azure OpenAI Service
+
+```bash
+export MIDSCENE_USE_AZURE_OPENAI=1
+export MIDSCENE_AZURE_OPENAI_SCOPE="https://cognitiveservices.azure.com/.default"
+export MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON='{"apiVersion": "2024-11-01-preview", "endpoint": "...", "deployment": "..."}'
+```
+
+## Choose a model other than `gpt-4o`
+
+We find that `gpt-4o` performs the best for Midscene at this moment. The other known supported models are `claude-3-opus-20240229`, `gemini-1.5-pro`, `qwen-vl-max-latest`, `doubao-vision-pro-32k`
+
+If you want to use other models, please follow these steps:
+
+1. Choose a model that supports image input (a.k.a. multimodal model).
+2. Find out how to to call it with an OpenAI SDK compatible endpoint. Usually you should set the `OPENAI_BASE_URL`, `OPENAI_API_KEY` and `MIDSCENE_MODEL_NAME`.
+3. If you find it not working well after changing the model, you can try using some short and clear prompt (or roll back to the previous model). See more details in [Prompting Tips](./prompting-tips.html).
+4. Remember to follow the terms of use of each model.
+
+## Example: Using `claude-3-opus-20240229` from Anthropic
+
+When configuring `MIDSCENE_USE_ANTHROPIC_SDK=1`, Midscene will use Anthropic SDK (`@anthropic-ai/sdk`) to call the model.
+
+Configure the environment variables:
+
+```bash
+export MIDSCENE_USE_ANTHROPIC_SDK=1
+export ANTHROPIC_API_KEY="....."
+export MIDSCENE_MODEL_NAME="claude-3-opus-20240229"
+```
+
+## Example: Using `gemini-1.5-pro` from Google
+
+Configure the environment variables:
+
+```bash
+export OPENAI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai"
+export OPENAI_API_KEY="....."
+export MIDSCENE_MODEL_NAME="gemini-1.5-pro"
 ```
+
+## Example: Using `qwen-vl-max-latest` from Aliyun
+
+Configure the environment variables:
+
+```bash
+export OPENAI_API_KEY="sk-..."
+export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
+export MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
+```
+
+## Example: Using `doubao-vision-pro-32k` from Volcengine
+
+Create a inference point first: https://console.volcengine.com/ark/region:ark+cn-beijing/endpoint
+
+Configure the environment variables:
+
+```bash
+export OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
+export OPENAI_API_KEY="..."
+export MIDSCENE_MODEL_NAME="ep-202....."
+```
+
+## Troubleshooting LLM Service Connectivity Issues
+
+If you want to troubleshoot connectivity issues, you can use the 'connectivity-test' folder in our example project: [https://github.com/web-infra-dev/midscene-example/tree/main/connectivity-test](https://github.com/web-infra-dev/midscene-example/tree/main/connectivity-test)
+
+Put your `.env` file in the `connectivity-test` folder, and run the test with `npm i && npm run test`.
diff --git a/apps/site/docs/en/quick-experience.mdx b/apps/site/docs/en/quick-experience.mdx
@@ -1,4 +1,4 @@
-# Experience By Chrome Extension
+# Quick Experience by Chrome Extension
 
 Midscene.js provides a Chrome extension. By using it, you can quickly experience the main features of Midscene on any webpage, without needing to set up a code project.
 
@@ -20,7 +20,7 @@ Start the extension (may be folded by default), setup the config by pasting the
 OPENAI_API_KEY="sk-replace-by-your-own"
 ```
 
-You can also paste the configuration as described in [customize model provider](./model-provider.html) here.
+You can also paste the configuration as described in [customize model and provider](./model-provider.html) here.
 
 ## Start experiencing
 
@@ -43,4 +43,10 @@ After experiencing, you may want to write some code to integrate Midscene. There
 
 * Extension fails to run and shows 'Cannot access a chrome-extension:// URL of different extension'
 
-Make sure you are using the Midscene extension on a normal http(s):// page. If the error persists, it's mainly due to conflicts with other extensions injecting `<iframes />` into the page. Try disabling the suspicious plugins and refresh.
+It's mainly due to conflicts with other extensions injecting `<iframe />` or `<script />` into the page. Try disabling the suspicious plugins and refresh. 
+
+To find the suspicious plugins:
+
+1. Open the Devtools of the page, find the `<script>` or `<iframe>` with a url like `chrome-extension://{ID-of-the-suspicious-plugin}/...`.
+2. Copy the ID from the url, open chrome://extensions/, find the plugin with the same ID, disable it.
+3. Refresh the page, try again.
diff --git a/apps/site/docs/zh/automate-with-scripts-in-yaml.mdx b/apps/site/docs/zh/automate-with-scripts-in-yaml.mdx
@@ -40,7 +40,7 @@ export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"
 OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"
 ```
 
-或 [自定义模型服务](./model-provider.html)
+或 [自定义模型和服务商](./model-provider.html)
 
 ## 开始
 

diff --git a/apps/site/docs/zh/faq.md b/apps/site/docs/zh/faq.md
@@ -16,11 +16,9 @@ Midscene 存在一些局限性，我们仍在努力改进。
 2. 稳定性风险：即使是 GPT-4o 也无法确保 100% 返回正确答案。遵循 [编写提示词的技巧](./prompting-tips) 可以帮助提高 SDK 稳定性。
 3. 元素访问受限：由于我们使用 JavaScript 从页面提取元素，所以无法访问 iframe 内部的元素。
 
-## 选用那个 LLM 模型？
+## 能否选用 `gpt-4o` 以外的其他模型？
 
-Midscene 需要一个能够理解用户界面的多模态大型语言模型。目前，我们发现 OpenAI 的 GPT-4o 表现最好，远超其它模型。
-
-你可以根据需要[自定义模型服务](./model-provider.html)。
+可以。你可以[自定义模型和服务商](./model-provider.html)。
 
 ## 关于 token 成本
 

diff --git a/apps/site/docs/zh/index.mdx b/apps/site/docs/zh/index.mdx
@@ -1,13 +1,11 @@
 # Midscene.js - AI 加持，带来愉悦的 UI 自动化体验
 
-UI 自动化太难维护了。UI 自动化脚本里往往到处都是选择器，比如 `#ids`、`data-test`、`.selectors`。在需要重构的时候，这可能会让人感到非常头疼，尽管在这种情况下，理论上UI自动化应该能够发挥作用。
+UI 自动化太难维护了。UI 自动化脚本里往往到处都是选择器，比如 `#ids`、`data-test`、`.selectors`。在需要重构的时候，这可能会让人感到非常头疼，尽管在这种情况下，UI 自动化应该能够发挥作用。
 
 我们在这里推出 Midscene.js，助你重拾编码的乐趣。
 
 Midscene.js 采用了多模态大语言模型（LLM），能够直观地“理解”你的用户界面并执行必要的操作。你只需描述交互步骤或期望的数据格式，AI 就能为你完成任务。
 
-目前我们默认选择的是 OpenAI GPT-4o 作为模型，你也可以自定义为其他模型。
-
 <video src="/introduction/Midscene.mp4" controls/>
 
 ## 通过 AI 执行交互、提取数据和断言
@@ -22,7 +20,7 @@ Midscene.js 采用了多模态大语言模型（LLM），能够直观地“理
 
 ```typescript
 // 👀 输入关键字，执行搜索
-// 注：尽管这是一个英文页面，你也可以用中文指令控制它
+// 尽管这是一个英文页面，你也可以用中文指令控制它
 await ai('在搜索框输入 "Headphones" ，敲回车');
 
 // 👀 找到列表里耳机相关的信息
@@ -55,4 +53,12 @@ console.log("headphones in stock", items);
 
 ## 直连模型端，无需三方服务
 
-Midscene.js 是一个采用 MIT 许可证的开源项目 (GitHub: [Midscene](https://github.com/web-infra-dev/midscene/)) 。项目代码运行在用户的自有环境中，所有从页面收集的数据会依照用户的配置，直接传送到 OpenAI 或指定的自定义模型。因此，数据仅用户和指定的模型服务商可访问，任何第三方平台均无法获取这些数据。
+Midscene.js 是一个采用 MIT 许可证的开源项目 (GitHub: [Midscene](https://github.com/web-infra-dev/midscene/)) 。项目代码运行在用户的自有环境中，所有从页面收集的数据会依照用户的配置，直接传送到 OpenAI 或指定的自定义模型。因此，数据仅用户和指定的模型服务商可访问，任何第三方平台均无法获取这些数据。
+
+## 自定义模型
+
+目前我们默认选择的是 OpenAI GPT-4o 作为模型，你也可以[自定义为其他多模态模型](./model-provider.html)。
+
+## 从 Chrome插件开始快速体验
+
+通过使用 Midscene.js Chrome 插件，你可以快速在任意网页上体验 Midscene 的主要功能，而无需编写任何代码。请参照文档 [通过 Chrome 插件快速体验](./quick-experience.html) 进行安装和配置。
diff --git a/apps/site/docs/zh/integrate-with-playwright.mdx b/apps/site/docs/zh/integrate-with-playwright.mdx
@@ -13,7 +13,7 @@ import { PackageManagerTabs } from '@theme';
 
 ## 准备工作
 
-配置 OpenAI API Key，或 [自定义模型服务](./model-provider.html)
+配置 OpenAI API Key，或 [自定义模型和服务商](./model-provider.html)
 
 ```bash
 # 更新为你自己的 Key

diff --git a/apps/site/docs/zh/integrate-with-puppeteer.mdx b/apps/site/docs/zh/integrate-with-puppeteer.mdx
@@ -7,11 +7,13 @@ import { PackageManagerTabs } from '@theme';
 
 :::info 样例项目
 你可以在这里看到向 Puppeteer 集成的样例项目：[https://github.com/web-infra-dev/midscene-example/blob/main/puppeteer-demo](https://github.com/web-infra-dev/midscene-example/blob/main/puppeteer-demo)
+
+这里还有一个 Puppeteer 和 Vitest 结合的样例项目：[https://github.com/web-infra-dev/midscene-example/tree/main/puppeteer-with-vitest-demo](https://github.com/web-infra-dev/midscene-example/tree/main/puppeteer-with-vitest-demo)
 :::
 
 ## 准备工作
 
-配置 OpenAI API Key，或 [自定义模型服务](./model-provider.html)
+配置 OpenAI API Key，或 [自定义模型和服务商](./model-provider.html)
 
 ```bash
 # 更新为你自己的 Key
-Original file line number
+Diff line change
@@ Expand Up / @@ -52,6 +52,7 @@ jspm_packages/ @@
     # dotenv environment variables file
     .env
+    .env.*
     # next.js build output
     .next
@@ Expand Down @@