feat: update planning prompt

web-infra-dev · Dec 16, 2024 · 2346a80 · 2346a80
1 parent b4debd5
commit 2346a80
Show file tree

Hide file tree

Showing 3 changed files with 21 additions and 17 deletions.
diff --git a/packages/midscene/src/ai-model/prompt/planning.ts b/packages/midscene/src/ai-model/prompt/planning.ts
@@ -18,7 +18,7 @@ const quickAnswerFormat = () => {
 
   const sample = matchByPosition
     ? '{"position": { x: 100, y: 200 }}'
-    : '{"id": "14562"}';
+    : '{"id": "c81c4e9a33"}';
 
   return {
     description,
@@ -41,18 +41,18 @@ You are a versatile professional in software UI automation. Your outstanding con
 ## Workflow
 
 1. Receive the user's element description, screenshot, and instruction.
-2. Decompose the user's task into a sequence of actions, and place it in the \`actions\` field. There are different types of actions (Tap / Hover / Input / KeyboardPress / Scroll / Error / Sleep). Please refer to the "About the action" section below.
-3. Precisely locate the target element if it's already shown in the screenshot, put the location info in the \`locate\` field.
-4. Consider whether a task will be accomplished after all the actions
+2. Decompose the user's task into a sequence of actions, and place it in the \`actions\` field. There are different types of actions (Tap / Hover / Input / KeyboardPress / Scroll / Error / Sleep). The "About the action" section below will give you more details.
+3. Precisely locate the target element if it's already shown in the screenshot, put the location info in the \`locate\` field of the action.
+4. If some target elements is not shown in the screenshot, consider the user's instruction is not feasible on this page. Follow the next steps.
+5. Consider whether the user's instruction will be accomplished after all the actions
  - If yes, set \`taskWillBeAccomplished\` to true
- - If no, don't plan more actions by closing the array. Get ready to reevaluate the task. Some talent people like you will handle this. Give him a clear description of what have been done and what to do next. Put your new plan in the \`furtherPlan\` field. Refer to the "How to compose the \`taskWillBeAccomplished\` and \`furtherPlan\` fields" section for more details.
+ - If no, don't plan more actions by closing the array. Get ready to reevaluate the task. Some talent people like you will handle this. Give him a clear description of what have been done and what to do next. Put your new plan in the \`furtherPlan\` field. The "How to compose the \`taskWillBeAccomplished\` and \`furtherPlan\` fields" section will give you more details.
 
 ## Constraints
 
 - All the actions you composed MUST be based on the page context information you get.
 - Trust the "What have been done" field about the task (if any), don't repeat actions in it.
-- Some elements may be shown after some actions are finished, consider this as a normal situation.
-- If you cannot plan any actions, consider the page content is irrelevant to the task. Put the error message in the \`error\` field.
+- If you cannot plan any actions at all, consider the page content is irrelevant to the task. Put the error message in the \`error\` field.
 
 ## About the \`actions\` field
 
@@ -61,9 +61,9 @@ You are a versatile professional in software UI automation. Your outstanding con
 The \`locate\` param is commonly used in the \`param\` field of the action, means to locate the target element to perform the action, it follows the following scheme:
 
 type LocateParam = {
-  "id": string, // the id of the element found. If its not on the page, locate should be null
+  "id": string, // the id of the element found. It should either be the id marked with a rectangle in the screenshot or the id described in the description.
   prompt?: string // the description of the element to find. It can only be omitted when locate is null.
-} | null
+} | null // If it's not on the page, the LocateParam should be null
 
 ### Supported actions
 
@@ -102,7 +102,7 @@ Please return the result in JSON format as follows:
       "type": "Tap",
       "param": null,
       "locate": {
-        {"id": "14562"},
+        {"id": "c81c4e9a33"},
         prompt: "the search bar"
       } | null,
     },
@@ -124,7 +124,7 @@ ${samplePageDescription}
 By viewing the page screenshot and description, you should consider this and output the JSON:
 
 * The main steps should be: tap the switch button, sleep, and tap the 'English' option 
-* The language switch button is shown in the screenshot. By checking the page screenshot, you can locate its ID by the coordinates and context information.
+* The language switch button is shown in the screenshot, but it's not marked with a rectangle. So we have to use the page description to find the element. By carefully checking the context information (coordinates, attributes, content, etc.), you can find the element.
 * The "English" option button is not shown in the screenshot now, it means it may only show after the previous actions are finished. So the last action will have a \`null\` value in the \`locate\` field. 
 * The task cannot be accomplished (because we cannot see the "English" option now), so a \`furtherPlan\` field is needed.
 

diff --git a/packages/midscene/src/ai-model/prompt/util.ts b/packages/midscene/src/ai-model/prompt/util.ts
@@ -203,16 +203,17 @@ export function elementByPosition(
 
 export const samplePageDescription = `
 The size of the page: 1280 x 720
+Some of the elements are marked with a rectangle in the screenshot, some are not.
 
-JSON description of the elements in screenshot:
-id=1231: {
-  "markerId": 2, // The number indicated by the boxed label in the screenshot
+JSON description of all the elements in screenshot:
+id=c81c4e9a33: {
+  "markerId": 2, // The number indicated by the rectangle label in the screenshot
   "attributes":  // Attributes of the element
     {"data-id":"@submit s0","class":".gh-search","aria-label":"搜索","nodeType":"IMG", "src": "image_url"},
   "rect": { "left": 16, "top": 378, "width": 89, "height": 16 } // Position of the element in the page
 }
 
-id=459308: {
+id=5a29bf6419bd: {
   "content": "获取优惠券",
   "attributes": { "nodeType": "TEXT" },
   "rect": { "left": 32, "top": 332, "width": 70, "height": 18 }
@@ -244,7 +245,7 @@ export async function describeUserPage<
   const idElementMap: Record<string, ElementType> = {};
   elementsInfo.forEach((item) => {
     idElementMap[item.id] = item;
-    // sometimes GPT will mess up the indexId and id, we use indexId as a backup
+    // accept indexId/markerId as a backup
     if ((item as any).indexId) {
       idElementMap[(item as any).indexId] = item;
     }
@@ -267,12 +268,13 @@ export async function describeUserPage<
   return {
     description: `
 The size of the page: ${describeSize({ width, height })}
+Some of the elements are marked with a rectangle in the screenshot, some are not.
 
 ${
   // if match by id, use the description of the element
   getAIConfig(MATCH_BY_POSITION)
     ? ''
-    : `Json description of the page elements:\n${contentList}`
+    : `Json description of all the page elements:\n${contentList}`
 }
 `,
     elementById(id: string) {

diff --git a/packages/web-integration/tests/unit-test/web-extractor.test.ts b/packages/web-integration/tests/unit-test/web-extractor.test.ts
@@ -73,6 +73,8 @@ describe(
       const item2 = filterTargetElement(content2);
       expect(item2).toBeDefined();
       expect(item2?.id).toBe(item?.id);
+
+      await reset();
     });
 
     it('check screenshot size - 1x', async () => {