diff --git a/README.md b/README.md index da3fec700..c9dbd223c 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,8 @@ See also: - https://github.com/OpenAdaptAI/pynput - https://github.com/OpenAdaptAI/atomacos -# OpenAdapt: AI-First Process Automation with Large Multimodal Models (LMMs). +# OpenAdapt: Open Source Generative Process Automation. +## AI-First Process Automation with Large Multimodal Models (LMMs). **OpenAdapt** is the **open** source software **adapt**er between Large Multimodal Models (LMMs) and traditional desktop and web Graphical User Interfaces (GUIs). @@ -35,9 +36,8 @@ with the power of Large Multimodal Modals (LMMs) by: - Recording screenshots and associated user input - Aggregating and visualizing user input and recordings for development - Converting screenshots and user input into tokenized format -- Generating synthetic input via transformer model completions -- Generating task trees by analyzing recordings (work-in-progress) -- Replaying synthetic input to complete tasks (work-in-progress) +- Generating and replaying synthetic input via transformer model completions +- Generating process graphs by analyzing recording logs (work-in-progress) The goal is similar to that of [Robotic Process Automation](https://en.wikipedia.org/wiki/Robotic_process_automation), @@ -165,37 +165,6 @@ pointing the cursor and left or right clicking, as described in this [open issue](https://github.com/OpenAdaptAI/OpenAdapt/issues/145) -### Capturing Browser Events - -To capture (record) browser events in Chrome, follow these steps: - -1. Go to: [Chrome Extension Page](chrome://extensions/) - -2. Enable `Developer mode` (located at the top right): - -![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/c97eb9fb-05d6-465d-85b3-332694556272) - -3. Click `Load unpacked` (located at the top left). - -![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/00c8adf5-074a-4655-b132-fd87644007fc) - -4. Select the `chrome_extension` directory: - -![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/71610ed3-f8d4-431a-9a22-d901127b7b0c) - -5. You should see the following confirmation, indicating that the extension is loaded: - -![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/7ee19da9-37e0-448f-b9ab-08ef99110e85) - -6. Set the flag to `true` if it is currently `false`: - -![image](https://github.com/user-attachments/assets/8eba24a3-7c68-4deb-8fbe-9d03cece1482) - -7. Start recording. Once recording begins, navigate to the Chrome browser, browse some pages, and perform a few clicks. Then, stop the recording and let it complete successfully. - -8. After recording, check the `openadapt.db` table `browser_event`. It should contain all your browser activity logs. You can verify the data's correctness using the `sqlite3` CLI or an extension like `SQLite Viewer` in VS Code to open `data/openadapt.db`. - - ### Visualize Quickly visualize the latest recording you created by running the following command: @@ -243,6 +212,7 @@ Other replay strategies include: - [`StatefulReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/stateful.py): Early proof-of-concept which uses the OpenAI GPT-4 API with prompts constructed via OS-level window data. - (*)[`VisualReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/visual.py): Uses [Fast Segment Anything Model (FastSAM)](https://github.com/CASIA-IVA-Lab/FastSAM) to segment active window. - (*)[`VanillaReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/vanilla.py): Assumes the model is capable of directly reasoning on states and actions accurately. With future frontier models, we hope that this script will suddenly work a lot better. +- (*)[`BrowserReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/browser.py): Uses the browser extension to read the visible DOM, and refers to recorded browser events to identify target elements. The (*) prefix indicates strategies which accept an "instructions" parameter that is used to modify the recording, e.g.: @@ -253,6 +223,22 @@ python -m openadapt.replay VanillaReplayStrategy --instructions "calculate 9-8" See https://github.com/OpenAdaptAI/OpenAdapt/tree/main/openadapt/strategies for a complete list. More ReplayStrategies coming soon! (see [Contributing](#Contributing)). +### Browser integration + +To record browser events in Google Chrome (required by the `BrowserReplayStrategy`), follow these steps: + +1. Go to your Chrome extensions page by entering [chrome://extensions](chrome://extensions/) in your address bar. + +2. Enable `Developer mode` (located at the top right). + +3. Click `Load unpacked` (located at the top left). + +4. Select the `chrome_extension` directory in the OpenAdapt repo. + +5. Make sure the Chrome extension is enabled (the switch to the right of the OpenAdapt extension widget is turned on). + +6. Set the `RECORD_BROWSER_EVENTS` flag to `true` in `openadapt/data/config.json`. + ## Features ### State-of-the-art GUI understanding via [Segment Anything in High Quality](https://github.com/SysCV/sam-hq): @@ -306,13 +292,6 @@ We're looking forward to your contributions. Let's build the future 🚀 ## Contributing -### Notable Works-in-progress (incomplete, see https://github.com/OpenAdaptAI/OpenAdapt/pulls and https://github.com/OpenAdaptAI/OpenAdapt/issues/ for more) - -- [Video Recording Hardware Acceleration](https://github.com/OpenAdaptAI/OpenAdapt/issues/570) (help wanted) -- [Audio Narration](https://github.com/OpenAdaptAI/OpenAdapt/pull/346) (help wanted) -- [Chrome Extension](https://github.com/OpenAdaptAI/OpenAdapt/pull/364) (help wanted) -- [Gemini Vision](https://github.com/OpenAdaptAI/OpenAdapt/issues/551) (help wanted) - ### Replay Problem Statement Our goal is to automate the task described and demonstrated in a `Recording`. diff --git a/chrome_extension/background.js b/chrome_extension/background.js index a747b8669..24e6203fb 100644 --- a/chrome_extension/background.js +++ b/chrome_extension/background.js @@ -1,33 +1,28 @@ /** * @file background.js - * @description Creates a new background script that listens for messages from the content script - * and sends them to a WebSocket server. -*/ + * @description Background script that maintains the current mode and communicates with content scripts. + */ let socket; +let currentMode = null; // Maintain the current mode here let timeOffset = 0; // Global variable to store the time offset -/* - * TODO: - * Ideally we read `WS_SERVER_PORT`, `WS_SERVER_ADDRESS` and - * `RECONNECT_TIMEOUT_INTERVAL` from config.py, - * or it gets passed in somehow. -*/ +/* + * Note: these need to match the corresponding values in config[.defaults].json + */ let RECONNECT_TIMEOUT_INTERVAL = 1000; // ms let WS_SERVER_PORT = 8765; let WS_SERVER_ADDRESS = "localhost"; let WS_SERVER_URL = "ws://" + WS_SERVER_ADDRESS + ":" + WS_SERVER_PORT; - function socketSend(socket, message) { console.log({ message }); socket.send(JSON.stringify(message)); } - /* * Function to connect to the WebSocket server. -*/ + */ function connectWebSocket() { socket = new WebSocket(WS_SERVER_URL); @@ -38,11 +33,34 @@ function connectWebSocket() { socket.onmessage = function(event) { console.log("Message from server:", event.data); const message = JSON.parse(event.data); + + // Handle mode messages + if (message.type === 'SET_MODE') { + currentMode = message.mode; // Update the current mode + console.log(`Mode set to: ${currentMode}`); + + // Send the mode to all active tabs + chrome.tabs.query( + { + active: true, + }, + function(tabs) { + tabs.forEach(function(tab) { + chrome.tabs.sendMessage(tab.id, message, function(response) { + if (chrome.runtime.lastError) { + console.error("Error sending message to content script in tab " + tab.id, chrome.runtime.lastError.message); + } else { + console.log("Message sent to content script in tab " + tab.id, response); + } + }); + }); + } + ); + } }; socket.onclose = function(event) { console.log("WebSocket connection closed", event); - // Reconnect after 5 seconds if the connection is lost setTimeout(connectWebSocket, RECONNECT_TIMEOUT_INTERVAL); }; @@ -66,3 +84,32 @@ chrome.runtime.onMessage.addListener((message, sender, sendResponse) => { sendResponse({ status: "WebSocket connection not open" }); } }); + +/* Listen for tab activation */ +chrome.tabs.onActivated.addListener((activeInfo) => { + // Send current mode to the newly active tab if it's not null + if (currentMode) { + const message = { type: 'SET_MODE', mode: currentMode }; + chrome.tabs.sendMessage(activeInfo.tabId, message, function(response) { + if (chrome.runtime.lastError) { + console.error("Error sending message to content script in tab " + activeInfo.tabId, chrome.runtime.lastError.message); + } else { + console.log("Message sent to content script in tab " + activeInfo.tabId, response); + } + }); + } +}); + +/* Listen for tab updates to handle new pages or reloading */ +chrome.tabs.onUpdated.addListener((tabId, changeInfo, tab) => { + if (changeInfo.status === 'complete' && currentMode) { + const message = { type: 'SET_MODE', mode: currentMode }; + chrome.tabs.sendMessage(tabId, message, function(response) { + if (chrome.runtime.lastError) { + console.error("Error sending message to content script in tab " + tabId, chrome.runtime.lastError.message); + } else { + console.log("Message sent to content script in tab " + tabId, response); + } + }); + } +}); diff --git a/chrome_extension/content.js b/chrome_extension/content.js index a08daabb8..f9cb163cf 100644 --- a/chrome_extension/content.js +++ b/chrome_extension/content.js @@ -1,4 +1,116 @@ const DEBUG = true; + +if (!DEBUG) { + console.debug = function() {}; +} + +let currentMode = "idle"; // Default mode is 'idle' +let recordListenersAttached = false; // Track if record listeners are currently attached +let replayObserversAttached = false; // Track if replay observers are currently attached + +// Listen for messages from the background script or Python process +chrome.runtime.onMessage.addListener((message, sender, sendResponse) => { + console.log("Received message:", message); + if (message.type === 'SET_MODE') { + currentMode = message.mode; + console.log(`Mode set to: ${currentMode}`); + + // Attach or detach listeners based on mode + if (currentMode === 'record') { + if (!recordListenersAttached) attachRecordListeners(); + if (replayObserversAttached) disconnectReplayObservers(); // Detach replay observers if needed + } else if (currentMode === 'replay') { + debounceSendVisibleHTML('setmode'); + if (!replayObserversAttached) attachReplayObservers(); + if (recordListenersAttached) detachRecordListeners(); // Detach record listeners if needed + } else if (currentMode === 'idle') { + if (recordListenersAttached) detachRecordListeners(); + if (replayObserversAttached) disconnectReplayObservers(); + } + } +}); + +// Attach event listeners for recording mode +function attachRecordListeners() { + if (!recordListenersAttached) { + attachUserEventListeners(); + attachInstrumentationEventListeners(); + recordListenersAttached = true; + } +} + +// Attach user-generated event listeners +function attachUserEventListeners() { + console.log("attachUserEventListeners()"); + const eventsToCapture = ['click', 'keydown', 'keyup']; + + eventsToCapture.forEach(eventType => { + document.body.addEventListener(eventType, handleUserEvent, true); + }); +} + +// Attach instrumentation event listeners +function attachInstrumentationEventListeners() { + console.log("attachInstrumentationEventListeners()"); + const eventsToCapture = ['mousedown', 'mouseup', 'mousemove']; + + eventsToCapture.forEach(eventType => { + document.body.addEventListener(eventType, trackMouseEvent, true); + }); +} + +// Detach all event listeners for recording mode +function detachRecordListeners() { + const eventsToCapture = [ + 'click', 'keydown', 'keyup', 'mousedown', 'mouseup', 'mousemove' + ]; + + eventsToCapture.forEach(eventType => { + document.body.removeEventListener(eventType, handleUserEvent, true); + document.body.removeEventListener(eventType, trackMouseEvent, true); + }); + + recordListenersAttached = false; +} + +// Attach observers for replay mode +function attachReplayObservers() { + if (!replayObserversAttached) { + setupIntersectionObserver(); + setupMutationObserver(); + setupScrollAndResizeListeners(); + replayObserversAttached = true; + } +} + +// Disconnect observers for replay mode +function disconnectReplayObservers() { + if (window.intersectionObserverInstance) { + window.intersectionObserverInstance.disconnect(); + } + if (window.mutationObserverInstance) { + window.mutationObserverInstance.disconnect(); + } + window.removeEventListener('scroll', handleScrollEvent, { passive: true }); + window.removeEventListener('resize', handleResizeEvent, { passive: true }); + + replayObserversAttached = false; +} + +// Handle scroll events +function handleScrollEvent(event) { + debounceSendVisibleHTML(event.type); +} + +// Handle resize events +function handleResizeEvent(event) { + debounceSendVisibleHTML(event.type); +} + +/* + * Record + */ + const RETURN_FULL_DOCUMENT = false; const MAX_COORDS = 3; const SET_SCREEN_COORDS = false; @@ -123,19 +235,25 @@ function sendMessageToBackgroundScript(message) { } function generateElementIdAndBbox(element) { + console.debug(`[generateElementIdAndBbox] Processing element: ${element.tagName}`); + // ignore invisible elements if (!isVisible(element)) { + console.debug(`[generateElementIdAndBbox] Element is not visible: ${element.tagName}`); return; } // set id if (!elementIdMap.has(element)) { const newId = `elem-${elementIdCounter++}`; + console.debug(`[generateElementIdAndBbox] Generated new ID: ${newId} for element: ${element.tagName}`); elementIdMap.set(element, newId); idToElementMap.set(newId, element); // Reverse mapping element.setAttribute('data-id', newId); } + // TODO: store bounding boxes in a map instead of in DOM attributes + // set client bbox let { top, left, bottom, right } = element.getBoundingClientRect(); let bboxClient = `${top},${left},${bottom},${right}`; @@ -143,6 +261,7 @@ function generateElementIdAndBbox(element) { // set screen bbox if (SET_SCREEN_COORDS) { + // XXX TODO: support in replay mode, or remove altogether ({ top, left, bottom, right } = getScreenCoordinates(element)); if (top == null) { // not enough data points to get screen coordinates @@ -214,17 +333,17 @@ function cleanDomTree(node) { } } -function getVisibleHtmlString() { +function getVisibleHTMLString() { const startTime = performance.now(); // Step 1: Instrument the live DOM with data-id and data-bbox attributes instrumentLiveDomWithBbox(); if (RETURN_FULL_DOCUMENT) { - const visibleHtmlDuration = performance.now() - startTime; - console.log({ visibleHtmlDuration }); - const visibleHtmlString = document.body.outerHTML; - return { visibleHtmlString, visibleHtmlDuration }; + const visibleHTMLDuration = performance.now() - startTime; + console.log({ visibleHTMLDuration }); + const visibleHTMLString = document.body.outerHTML; + return { visibleHTMLString, visibleHTMLDuration }; } // Step 2: Clone the body @@ -234,12 +353,12 @@ function getVisibleHtmlString() { cleanDomTree(clonedBody); // Step 4: Serialize the modified clone to a string - const visibleHtmlString = clonedBody.outerHTML; + const visibleHTMLString = clonedBody.outerHTML; - const visibleHtmlDuration = performance.now() - startTime; - console.log({ visibleHtmlDuration }); + const visibleHTMLDuration = performance.now() - startTime; + console.debug({ visibleHTMLDuration }); - return { visibleHtmlString, visibleHtmlDuration }; + return { visibleHTMLString, visibleHTMLDuration }; } /** @@ -277,20 +396,20 @@ function validateCoordinates(event, eventTarget, attrType, coordX, coordY) { } } -function handleUserGeneratedEvent(event) { +function handleUserEvent(event) { const eventTarget = event.target; const eventTargetId = generateElementIdAndBbox(eventTarget); const timestamp = Date.now() / 1000; // Convert to Python-compatible seconds - const { visibleHtmlString, visibleHtmlDuration } = getVisibleHtmlString(); + const { visibleHTMLString, visibleHTMLDuration } = getVisibleHTMLString(); const eventData = { type: 'USER_EVENT', eventType: event.type, targetId: eventTargetId, timestamp: timestamp, - visibleHtmlString, - visibleHtmlDuration, + visibleHTMLString, + visibleHTMLDuration, }; if (event instanceof KeyboardEvent) { @@ -324,7 +443,7 @@ function attachUserEventListeners() { ]; eventsToCapture.forEach(eventType => { - document.body.addEventListener(eventType, handleUserGeneratedEvent, true); + document.body.addEventListener(eventType, handleUserEvent, true); }); } @@ -339,6 +458,118 @@ function attachInstrumentationEventListeners() { }); } -// Initial setup -attachUserEventListeners(); -attachInstrumentationEventListeners(); +/* + * Replay + */ + +let debounceTimeoutId = null; // Timeout ID for debouncing +const DEBOUNCE_DELAY = 10; + +function setupIntersectionObserver() { + const observer = new IntersectionObserver(handleIntersection, { + root: null, // Use the viewport as the root + threshold: 0 // Consider an element visible if any part of it is in view + }); + + document.querySelectorAll('*').forEach(element => observer.observe(element)); +} + +function handleIntersection(entries) { + let shouldSendUpdate = false; + entries.forEach(entry => { + if (entry.isIntersecting) { + shouldSendUpdate = true; + } + }); + if (shouldSendUpdate) { + debounceSendVisibleHTML('intersection'); + } +} + +function setupMutationObserver() { + const observer = new MutationObserver(handleMutations); + observer.observe(document.body, { + childList: true, + // XXX this results in continuous DOM_EVENT messages on some websites (e.g. ChatGPT) + subtree: true, + attributes: true + }); +} + +function handleMutations(mutationsList) { + const startTime = performance.now(); // Capture start time for the instrumentation + console.debug(`[handleMutations] Start handling ${mutationsList.length} mutations at ${startTime}`); + + let shouldSendUpdate = false; + + for (const mutation of mutationsList) { + console.debug(`[handleMutations] Mutation type: ${mutation.type}, target: ${mutation.target.tagName}`); + for (const node of mutation.addedNodes) { + if (node.nodeType === Node.ELEMENT_NODE) { + console.debug(`[handleMutations] Added node: ${node.tagName}`); + + // Uncommenting this freezes some websites (e.g. ChatGPT). + // It should not be necessary to call this here since it is also called in + // getVisibleHTMLString. + //generateElementIdAndBbox(node); // Generate a new ID and bbox for the added node + + if (isVisible(node)) { + shouldSendUpdate = true; + break; // Exit the loop early + } + } + } + if (shouldSendUpdate) break; // Exit outer loop if update is needed + + for (const node of mutation.removedNodes) { + console.log(`[handleMutations] Removed node: ${node.tagName}`); + if (node.nodeType === Node.ELEMENT_NODE && idToElementMap.has(node.getAttribute('data-id'))) { + shouldSendUpdate = true; + break; // Exit the loop early + } + } + if (shouldSendUpdate) break; // Exit outer loop if update is needed + } + + const endTime = performance.now(); + console.debug(`[handleMutations] Finished handling mutations. Duration: ${endTime - startTime}ms`); + + if (shouldSendUpdate) { + debounceSendVisibleHTML('mutation'); + } +} + +function debounceSendVisibleHTML(eventType) { + // Clear the previous timeout, if any + if (debounceTimeoutId) { + clearTimeout(debounceTimeoutId); + } + + console.debug(`[debounceSendVisibleHTML] Debouncing visible HTML send for event: ${eventType}`); + // Set a new timeout + debounceTimeoutId = setTimeout(() => { + sendVisibleHTML(eventType); + }, DEBOUNCE_DELAY); +} + +function sendVisibleHTML(eventType) { + console.debug(`Handling DOM event: ${eventType}`); + const timestamp = Date.now() / 1000; // Convert to Python-compatible seconds + + const { visibleHTMLString, visibleHTMLDuration } = getVisibleHTMLString(); + + const eventData = { + type: 'DOM_EVENT', + eventType: eventType, + timestamp: timestamp, + visibleHTMLString, + visibleHTMLDuration, + }; + + sendMessageToBackgroundScript(eventData); +} + +function setupScrollAndResizeListeners() { + window.addEventListener('scroll', handleScrollEvent, { passive: true }); + window.addEventListener('resize', handleResizeEvent, { passive: true }); +} diff --git a/openadapt/browser.py b/openadapt/browser.py index aa8f1cd6b..c8894bcae 100644 --- a/openadapt/browser.py +++ b/openadapt/browser.py @@ -1,16 +1,17 @@ """Utilities for working with BrowserEvents.""" from statistics import mean, median, stdev +import json -from bs4 import BeautifulSoup from copy import deepcopy from dtaidistance import dtw, dtw_ndim -from loguru import logger from sqlalchemy.orm import Session as SaSession from tqdm import tqdm import numpy as np +import websockets.sync.server from openadapt import models, utils +from openadapt.custom_logger import logger from openadapt.db import crud # action to browser @@ -79,6 +80,18 @@ ] +def set_browser_mode( + mode: str, websocket: websockets.sync.server.ServerConnection +) -> None: + """Send a message to the browser extension to set the mode.""" + logger.info(f"{type(websocket)=}") + VALID_MODES = ("idle", "record", "replay") + assert mode in VALID_MODES, f"{mode=} not in {VALID_MODES=}" + message = json.dumps({"type": "SET_MODE", "mode": mode}) + logger.info(f"sending {message=}") + websocket.send(message) + + def add_screen_tlbr(browser_events: list[models.BrowserEvent]) -> None: """Computes and adds the 'data-tlbr-screen' attribute for each element. @@ -96,29 +109,17 @@ def add_screen_tlbr(browser_events: list[models.BrowserEvent]) -> None: # Iterate over the events in reverse order for event in reversed(browser_events): - message = event.message - - event_type = message.get("eventType") - if event_type != "click": - continue - - visible_html_string = message.get("visibleHtmlString") - if not visible_html_string: - logger.warning("No visible HTML data available for event.") + try: + soup, target_element = event.parse() + except AssertionError as exc: + logger.warning(exc) continue - # Parse the visible HTML using BeautifulSoup - soup = BeautifulSoup(visible_html_string, "html.parser") - - # Fetch the target element using its data-id - target_id = message.get("targetId") - target_element = soup.find(attrs={"data-id": target_id}) - if not target_element: - logger.warning(f"No target element found for targetId: {target_id}") continue # Extract coordMappings from the message + message = event.message coord_mappings = message.get("coordMappings", {}) x_mappings = coord_mappings.get("x", {}) y_mappings = coord_mappings.get("y", {}) @@ -195,7 +196,7 @@ def add_screen_tlbr(browser_events: list[models.BrowserEvent]) -> None: target_element["data-tlbr-screen"] = new_screen_coords # Write the updated element back to the message - message["visibleHtmlString"] = str(soup) + message["visibleHTMLString"] = str(soup) logger.info("Finished processing all browser events for screen coordinates.") @@ -235,7 +236,7 @@ def identify_and_log_smallest_clicked_element( Args: browser_event: The browser event containing the click details. """ - visible_html_string = browser_event.message.get("visibleHtmlString") + visible_html_string = browser_event.message.get("visibleHTMLString") message_id = browser_event.message.get("id") logger.info("*" * 10) logger.info(f"{message_id=}") @@ -246,8 +247,7 @@ def identify_and_log_smallest_clicked_element( logger.warning("No visible HTML data available for click event.") return - # Parse the visible HTML using BeautifulSoup - soup = BeautifulSoup(visible_html_string, "html.parser") + soup = utils.parse_html(visible_html_string, "html.parser") target_element = soup.find(attrs={"data-id": target_id}) target_area = None if not target_element: diff --git a/openadapt/events.py b/openadapt/events.py index 5866a3656..c90edeba4 100644 --- a/openadapt/events.py +++ b/openadapt/events.py @@ -180,7 +180,6 @@ def make_parent_event( children = extra.get("children", []) browser_events = [child.browser_event for child in children if child.browser_event] if browser_events: - assert len(browser_events) <= 1, len(browser_events) browser_event = browser_events[0] event_dict["browser_event"] = browser_event diff --git a/openadapt/models.py b/openadapt/models.py index 1df82c45e..faf72bca0 100644 --- a/openadapt/models.py +++ b/openadapt/models.py @@ -8,6 +8,7 @@ import io import sys +from bs4 import BeautifulSoup from oa_pynput import keyboard from PIL import Image, ImageChops import numpy as np @@ -147,6 +148,14 @@ class ActionEvent(db.Base): "available_segment_descriptions", sa.String, ) + _active_browser_element = sa.Column( + "active_browser_element", + sa.String, + ) + _available_browser_elements = sa.Column( + "available_browser_elements", + sa.String, + ) mouse_button_name = sa.Column(sa.String) mouse_pressed = sa.Column(sa.Boolean) key_name = sa.Column(sa.String) @@ -193,6 +202,7 @@ def __init__(self, **kwargs: dict) -> None: for key, value in properties.items(): setattr(self, key, value) + # TODO: rename "available" to "target" @property def available_segment_descriptions(self) -> list[str]: """Gets the available segment descriptions.""" @@ -210,6 +220,53 @@ def available_segment_descriptions(self, value: list[str]) -> None: value ) + @property + def active_browser_element(self) -> BeautifulSoup | None: + if not self._active_browser_element: + return None + return utils.parse_html(self._active_browser_element) + + @active_browser_element.setter + def active_browser_element(self, value: BeautifulSoup) -> None: + if not value: + logger.warning(f"{value=}") + return + self._active_browser_element = str(value) + + @property + def available_browser_elements(self) -> BeautifulSoup | None: + # https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree + # The value True matches every tag it can. This code finds all the tags in the + # document, but none of the text strings + if not self._available_browser_elements: + return None + return utils.parse_html(self._available_browser_elements) + + @available_browser_elements.setter + def available_browser_elements(self, value: BeautifulSoup | None) -> None: + if not value: + logger.warning(f"{value=}") + return + try: + self._available_browser_elements = str(value) + except Exception as exc: + # something myterious is going on, because this works: + # self._available_browser_elements = value + # and so does this: + # self._available_browser_elements = 'foo' + # but sometimes this: + # self._available_browser_elements = value + # produces: + # 'NoneType' object is not callable + # apparently, so does this: + # BeautifulSoup(soup.prettyify()) + # XXX TODO: fix this + #logger.error(exc) + #self._available_browser_elements = '?' + #return self.available_browser_elements + import ipdb; ipdb.set_trace() + foo = 1 + children = sa.orm.relationship("ActionEvent") # TODO: replacing the above line with the following two results in an error: # AttributeError: 'list' object has no attribute '_sa_instance_state' @@ -482,6 +539,8 @@ def to_prompt_dict(self) -> dict[str, Any]: Returns: dictionary containing relevant properties from the ActionEvent. """ + if self.active_browser_element: + import ipdb; ipdb.set_trace() action_dict = deepcopy( { key: val @@ -497,10 +556,20 @@ def to_prompt_dict(self) -> dict[str, Any]: for key in ("mouse_x", "mouse_y", "mouse_dx", "mouse_dy"): if key in action_dict: del action_dict[key] + # TODO XXX: add target_segment_description? + + # Manually add properties to the dictionary if self.available_segment_descriptions: action_dict["available_segment_descriptions"] = ( self.available_segment_descriptions ) + if self.active_browser_element: + action_dict["active_browser_element"] = str(self.active_browser_element) + if self.available_browser_elements: + action_dict["available_browser_elements"] = str(self.available_browser_elements) + + if self.active_browser_element: + import ipdb; ipdb.set_trace() return action_dict @@ -649,10 +718,10 @@ def __str__(self) -> str: # Create a copy of the message to avoid modifying the original message_copy = copy.deepcopy(self.message) - # Truncate the visibleHtmlString in the copied message if it exists - if "visibleHtmlString" in message_copy: - message_copy["visibleHtmlString"] = utils.truncate_html( - message_copy["visibleHtmlString"], max_len=100 + # Truncate the visibleHTMLString in the copied message if it exists + if "visibleHTMLString" in message_copy: + message_copy["visibleHTMLString"] = utils.truncate_html( + message_copy["visibleHTMLString"], max_len=100 ) # Get all attributes except 'message' @@ -668,6 +737,41 @@ def __str__(self) -> str: # Return the complete representation including the truncated message return f"BrowserEvent({base_repr}, message={message_copy})" + def parse(self) -> tuple[BeautifulSoup, BeautifulSoup | None]: + """Parses the visible HTML and optionally extracts the target element. + + This method processes the browser event to parse the visible HTML and, + if the event type is "click", extracts the target HTML element that was + clicked. + + Returns: + A tuple containing: + - BeautifulSoup: The parsed soup of the visible HTML. + - BeautifulSoup | None: The target HTML element if the event type is + "click"; otherwise, None. + + Raises: + AssertionError: If the necessary data is missing. + """ + message = self.message + + visible_html_string = message.get("visibleHTMLString") + assert visible_html_string, "Cannot parse without visibleHTMLstring" + + # Parse the visible HTML using BeautifulSoup + soup = BeautifulSoup(visible_html_string, "html.parser") + + event_type = message.get("eventType") + target_element = None + + if event_type == "click": + # Fetch the target element using its data-id + target_id = message.get("targetId") + target_element = soup.find(attrs={"data-id": target_id}) + assert target_element, f"No target element found for targetId: {target_id}" + + return soup, target_element + # # TODO: implement # @classmethod # def get_active_browser_event( diff --git a/openadapt/record.py b/openadapt/record.py index 27eb9e578..eef25c7c8 100644 --- a/openadapt/record.py +++ b/openadapt/record.py @@ -24,6 +24,7 @@ from pympler import tracker import av +from openadapt.browser import set_browser_mode from openadapt.build_utils import redirect_stdout_stderr from openadapt.custom_logger import logger from openadapt.models import Recording @@ -1192,24 +1193,27 @@ def read_browser_events( """ utils.set_start_time(recording.timestamp) + # set the browser mode + set_browser_mode("record", websocket) + logger.info("Starting Reading Browser Events ...") while not terminate_processing.is_set(): - for message in websocket: - if not message: - continue - - timestamp = utils.get_timestamp() - - data = json.loads(message) - - event_q.put( - Event( - timestamp, - "browser", - {"message": data}, - ) + try: + message = websocket.recv(0.01) + except TimeoutError: + continue + timestamp = utils.get_timestamp() + data = json.loads(message) + event_q.put( + Event( + timestamp, + "browser", + {"message": data}, ) + ) + + set_browser_mode("idle", websocket) @logger.catch diff --git a/openadapt/strategies/__init__.py b/openadapt/strategies/__init__.py index fecc7c056..916843f3b 100644 --- a/openadapt/strategies/__init__.py +++ b/openadapt/strategies/__init__.py @@ -5,6 +5,7 @@ # flake8: noqa from openadapt.strategies.base import BaseReplayStrategy +from openadapt.strategies.browser import BrowserReplayStrategy # disabled because importing is expensive # from openadapt.strategies.demo import DemoReplayStrategy diff --git a/openadapt/strategies/base.py b/openadapt/strategies/base.py index ea7fb68c5..aac897504 100644 --- a/openadapt/strategies/base.py +++ b/openadapt/strategies/base.py @@ -21,6 +21,7 @@ def __init__( self, recording: models.Recording, max_frame_times: int = MAX_FRAME_TIMES, + include_a11y_data: bool = True, ) -> None: """Initialize the BaseReplayStrategy. @@ -34,6 +35,7 @@ def __init__( self.screenshots = [] self.window_events = [] self.frame_times = [] + self.include_a11y_data = include_a11y_data @abstractmethod def get_next_action_event( @@ -67,7 +69,10 @@ def run(self) -> None: continue self.screenshots.append(screenshot) - window_event = models.WindowEvent.get_active_window_event() + window_event = models.WindowEvent.get_active_window_event( + # TODO: rename + include_window_data=self.include_a11y_data, + ) self.window_events.append(window_event) try: action_event = self.get_next_action_event( diff --git a/openadapt/strategies/browser-extended.py b/openadapt/strategies/browser-extended.py new file mode 100644 index 000000000..bd37786c0 --- /dev/null +++ b/openadapt/strategies/browser-extended.py @@ -0,0 +1,412 @@ +""" +Implements a replay strategy for browser recordings. + +TODO: +- re-use approach from visual.py: segment each screenshot, prompt for descriptions +""" + +from pprint import pformat +from threading import Event, Thread +import json +import queue + +from bs4 import BeautifulSoup +from websockets.sync.server import ServerConnection + +from openadapt import adapters, config, models, utils, strategies +from openadapt.custom_logger import logger + +# Define ws_server_instance at the top scope +ws_server_instance = None + +# Define a whitelist of essential attributes +WHITELIST_ATTRIBUTES = [ + 'id', 'class', 'href', 'src', 'alt', 'name', 'type', 'value', 'title', 'data-*', 'aria-*' +] + + +class BrowserReplayStrategy(strategies.BaseReplayStrategy): + """ReplayStrategy using HTML and replay instructions.""" + + def __init__( + self, + recording: models.Recording, + instructions: str, + ) -> None: + """Initialize the BrowserReplayStrategy. + + Args: + recording (models.Recording): The recording object. + instructions (str): Natural language instructions for how recording + should be replayed. + """ + super().__init__(recording, include_a11y_data=False) + self.event_q = queue.Queue() + self.terminate_processing = Event() + self.recent_visible_html = "" + add_browser_elements(recording.processed_action_events) + self.browser_event_reader = Thread( + target=run_browser_event_server, + args=(self.event_q, self.terminate_processing), + ) + self.browser_event_reader.start() + + self.instructions = instructions + self.action_history = [] + self.modified_actions = self.apply_replay_instructions( + recording.processed_action_events, + instructions + ) + # Ensure browser elements are set for modified actions + add_browser_elements(self.modified_actions) + self.action_event_idx = 0 + + def get_recent_visible_html(self) -> str: + """Get the most recent visible DOM from the event queue. + + Returns: + str: The most recent visible DOM. + """ + num_messages_read = 0 + while not self.event_q.empty(): + event = self.event_q.get() + num_messages_read += 1 + self.recent_visible_html = event.data["message"]["visibleHTMLString"] + + if num_messages_read: + logger.info(f"{num_messages_read=} {len(self.recent_visible_html)=}") + return self.recent_visible_html + + def get_next_action_event( + self, + screenshot: models.Screenshot, + window_event: models.WindowEvent, + ) -> models.ActionEvent | None: + """Get the next ActionEvent for replay. + + Args: + screenshot (models.Screenshot): The screenshot object. + window_event (models.WindowEvent): The window event object. + + Returns: + models.ActionEvent or None: The next ActionEvent for replay or None + if there are no more events. + """ + # First, try the direct approach based on planned sequence. + try: + action = self._execute_planned_action( + screenshot=screenshot, + current_window_event=window_event + ) + if action: + return action + except Exception as e: + logger.warning(f"Direct generation approach failed: {e}") + + # Fallback to the planning approach if the direct approach fails. + try: + action = self._generate_next_action_plan( + screenshot=screenshot, + window_event=window_event, + recorded_actions=self.recording.processed_action_events, + replayed_actions=self.action_history, + instructions=self.instructions, + ) + return action + except Exception as e: + logger.error(f"Planning approach also failed: {e}") + return None + + def _execute_planned_action( + self, + screenshot: models.Screenshot, + current_window_event: models.WindowEvent, + ) -> models.ActionEvent | None: + """Try to execute the next planned action assuming it matches reality. + + Args: + screenshot (models.Screenshot): The current state screenshot. + current_window_event (models.WindowEvent): The current state window data. + + Returns: + models.ActionEvent or None: The next action event if the planned target exists. + """ + if self.action_event_idx >= len(self.modified_actions): + return None # No more actions to replay. + + planned_action = self.modified_actions[self.action_event_idx] + self.action_event_idx += 1 + + # Find target element in the current DOM. + recent_visible_html = self.get_recent_visible_html() + soup, target_element = self._find_element_in_dom(planned_action, recent_visible_html) + + if target_element: + planned_action.active_browser_element = target_element + self.action_history.append(planned_action) + return planned_action + else: + raise ValueError("Target element not found in the current DOM.") + + def _generate_next_action_plan( + self, + screenshot: models.Screenshot, + window_event: models.WindowEvent, + recorded_actions: list[models.ActionEvent], + replayed_actions: list[models.ActionEvent], + instructions: str, + ) -> models.ActionEvent | None: + """Fallback method to dynamically plan the next action event. + + Args: + screenshot (models.Screenshot): The current state screenshot. + window_event (models.WindowEvent): The current state window data. + recorded_actions (list[models.ActionEvent]): List of action events from the recording. + replayed_actions (list[models.ActionEvent]): List of actions produced during current replay. + instructions (str): Proposed modifications in natural language instructions. + + Returns: + models.ActionEvent or None: The next action event if successful, otherwise None. + """ + prompt_adapter = adapters.get_default_prompt_adapter() + system_prompt = utils.render_template_from_file("prompts/system.j2") + prompt = utils.render_template_from_file( + "prompts/generate_action_event--browser.j2", # Updated template file name + current_window=window_event.to_prompt_dict(), + recorded_actions=[action.to_prompt_dict() for action in recorded_actions], + replayed_actions=[action.to_prompt_dict() for action in replayed_actions], + replay_instructions=instructions, + ) + + content = prompt_adapter.prompt( + prompt, + system_prompt=system_prompt, + images=[screenshot.image], + ) + action_dict = utils.parse_code_snippet(content) + logger.info(f"{action_dict=}") + if not action_dict: + return None + + return models.ActionEvent.from_dict(action_dict) + + def apply_replay_instructions( + self, + action_events: list[models.ActionEvent], + replay_instructions: str, + ) -> list[models.ActionEvent]: + """Modify the given ActionEvents according to the given replay instructions. + + Args: + action_events: list of action events to be modified in place. + replay_instructions: instructions for how action events should be modified. + + Returns: + list[models.ActionEvent]: The modified list of action events. + """ + action_dicts = [action.to_prompt_dict() for action in action_events] + actions_dict = {"actions": action_dicts} + system_prompt = utils.render_template_from_file("prompts/system.j2") + prompt = utils.render_template_from_file( + "prompts/apply_replay_instructions--browser.j2", # Updated template file name + actions=actions_dict, + replay_instructions=replay_instructions, + # TODO: remove + exceptions=[], + ) + print(prompt) + import ipdb; ipdb.set_trace() + prompt_adapter = adapters.get_default_prompt_adapter() + content = prompt_adapter.prompt(prompt, system_prompt=system_prompt) + content_dict = utils.parse_code_snippet(content) + + try: + action_dicts = content_dict["actions"] + except TypeError as exc: + logger.warning(exc) + action_dicts = content_dict # OpenAI sometimes returns a list of dicts directly. + + modified_actions = [] + for action_dict in action_dicts: + action = models.ActionEvent.from_dict(action_dict) + modified_actions.append(action) + return modified_actions + + def _find_element_in_dom(self, planned_action: models.ActionEvent, html: str): + """Locate the target element in the current HTML DOM. + + Args: + planned_action (models.ActionEvent): The planned action with target element info. + html (str): The current HTML content. + + Returns: + Tuple[BeautifulSoup, Tag or None]: Parsed HTML and the target element or None. + """ + soup = BeautifulSoup(html, 'html.parser') + target_selector = planned_action.active_browser_element # Assuming selector or similar identifier is used. + target_element = soup.select_one(target_selector) # Simplify finding elements. + + return soup, target_element + + def __del__(self) -> None: + """Clean up resources and log action history.""" + self.terminate_processing.set() + action_history_dicts = [action.to_prompt_dict() for action in self.action_history] + logger.info(f"action_history=\n{pformat(action_history_dicts)}") + + +def clean_html_attributes(element: BeautifulSoup) -> str: + """Retain only essential attributes from an HTML element based on a whitelist. + + Args: + element: A BeautifulSoup tag element. + + Returns: + A string representing the cleaned HTML element. + """ + whitelist_attrs = [] + + # Go through each attribute in the element and keep only whitelisted ones + for attr_name, attr_value in element.attrs.items(): + if attr_name in WHITELIST_ATTRIBUTES or attr_name.startswith('data-') or attr_name.startswith('aria-'): + whitelist_attrs.append((attr_name, attr_value)) + else: + logger.debug(f"Removing attribute from <{element.name}>: {attr_name}='{attr_value}'") + + # Update the element with only whitelisted attributes + element.attrs = dict(whitelist_attrs) + return str(element) + + +def filter_and_clean_html(soup: BeautifulSoup) -> str: + """Filter out irrelevant elements, clean attributes, and log removed elements. + + Args: + soup: BeautifulSoup object of the parsed HTML. + + Returns: + A string representing the cleaned HTML. + """ + # Define relevant elements for action replay + relevant_tags = ['a', 'button', 'div', 'span', 'input', 'img', 'form', 'iframe'] + relevant_elements = [] + + # Find relevant elements and log removal of irrelevant ones + for el in soup.find_all(): + if el.name in relevant_tags: + relevant_elements.append(el) + else: + logger.debug(f"Removing element <{el.name}> with attributes: {el.attrs}") + + # Clean each relevant element + cleaned_elements = [clean_html_attributes(el) for el in relevant_elements] + + # Recreate a simplified HTML structure with only the cleaned elements + return ''.join(cleaned_elements) + + +def add_browser_elements(action_events: list) -> None: + """Set the ActionEvent.active_browser_element where appropriate and log actions. + + Args: + action_events: list of ActionEvents to modify in-place. + """ + action_browser_tups = [ + (action, action.browser_event) + for action in action_events + if action.browser_event + ] + for action, browser in action_browser_tups: + soup, target_element = browser.parse() + if not target_element: + logger.warning(f"{target_element=}") + continue + + # Convert BeautifulSoup object to cleaned HTML strings + action.active_browser_element = clean_html_attributes(target_element) + action.available_browser_elements = filter_and_clean_html(soup) + + # Verify the cleaned elements + assert action.active_browser_element, action.active_browser_element + assert action.available_browser_elements, action.available_browser_elements + + import ipdb; ipdb.set_trace() + foo = 2 + + +def run_browser_event_server( + event_q: queue.Queue, + terminate_processing: Event, +) -> None: + """Run the browser event server. + + Params: + event_q: A queue for adding browser events. + terminate_processing: An event to signal the termination of the process. + + Returns: + None + """ + global ws_server_instance + + def run_server() -> None: + global ws_server_instance + with ServerConnection( + lambda ws: read_browser_events(ws, event_q, terminate_processing), + config.BROWSER_WEBSOCKET_SERVER_IP, + config.BROWSER_WEBSOCKET_PORT, + max_size=config.BROWSER_WEBSOCKET_MAX_SIZE, + ) as server: + ws_server_instance = server + logger.info("WebSocket server started") + server.serve_forever() + + server_thread = Thread(target=run_server) + server_thread.start() + terminate_processing.wait() + logger.info("Termination signal received, shutting down server") + + if ws_server_instance: + ws_server_instance.shutdown() + + server_thread.join() + + +def read_browser_events( + websocket: ServerConnection, + event_q: queue.Queue, + terminate_processing: Event, +) -> None: + """Read browser events and add them to the event queue. + + Params: + websocket: The websocket object. + event_q: A queue for adding browser events. + terminate_processing: An event to signal the termination of the process. + + Returns: + None + """ + set_browser_mode("replay", websocket) + utils.set_start_time() + logger.info("Starting Reading Browser Events ...") + + try: + while not terminate_processing.is_set(): + try: + message = websocket.recv(0.01) + except TimeoutError: + continue + timestamp = utils.get_timestamp() + logger.info(f"{len(message)=}") + data = json.loads(message) + assert data["type"] == "DOM_EVENT", data["type"] + event_q.put( + models.BrowserEvent( + timestamp=timestamp, + message=data, + ) + ) + finally: + set_browser_mode("idle", websocket) + diff --git a/openadapt/strategies/browser.py b/openadapt/strategies/browser.py new file mode 100644 index 000000000..f6771f856 --- /dev/null +++ b/openadapt/strategies/browser.py @@ -0,0 +1,407 @@ +""" +TODO: +- re-use approach from visual.py: segment each screenshot, prompt for descriptions +""" + +from pprint import pformat +from threading import Event, Thread +import json +import queue + +from bs4 import BeautifulSoup +from websockets.sync.server import ServerConnection + +from openadapt import adapters, config, models, utils, strategies +from openadapt.custom_logger import logger + +# Define ws_server_instance at the top scope +ws_server_instance = None + +class BrowserReplayStrategy(strategies.BaseReplayStrategy): + """ReplayStrategy using HTML and replay instructions.""" + + def __init__( + self, + recording: models.Recording, + instructions: str, + ) -> None: + """Initialize the BrowserReplayStrategy. + + Args: + recording (models.Recording): The recording object. + instructions (str): Natural language instructions for how recording + should be replayed. + """ + super().__init__(recording, include_a11y_data=False) + self.event_q = queue.Queue() + self.terminate_processing = Event() + self.recent_visible_html = "" + add_browser_elements(recording.processed_action_events) + self.browser_event_reader = Thread( + target=run_browser_event_server, + args=(self.event_q, self.terminate_processing), + ) + self.browser_event_reader.start() + + self.instructions = instructions + self.action_history = [] + self.modified_actions = self.apply_replay_instructions( + recording.processed_action_events, + instructions + ) + # Ensure browser elements are set for modified actions + add_browser_elements(self.modified_actions) + self.action_event_idx = 0 + + def get_recent_visible_html(self) -> str: + """Get the most recent visible DOM from the event queue. + + Returns: + str: The most recent visible DOM. + """ + num_messages_read = 0 + while not self.event_q.empty(): + event = self.event_q.get() + num_messages_read += 1 + self.recent_visible_html = event.data["message"]["visibleHTMLString"] + + if num_messages_read: + logger.info(f"{num_messages_read=} {len(self.recent_visible_html)=}") + return self.recent_visible_html + + def get_next_action_event( + self, + screenshot: models.Screenshot, + window_event: models.WindowEvent, + ) -> models.ActionEvent | None: + """Get the next ActionEvent for replay. + + Args: + screenshot (models.Screenshot): The screenshot object. + window_event (models.WindowEvent): The window event object. + + Returns: + models.ActionEvent or None: The next ActionEvent for replay or None + if there are no more events. + """ + # First, try the direct approach based on planned sequence. + try: + action = self._execute_planned_action( + screenshot=screenshot, + current_window_event=window_event + ) + if action: + return action + except Exception as e: + logger.warning(f"Direct generation approach failed: {e}") + + # Fallback to the planning approach if the direct approach fails. + try: + action = self._generate_next_action_plan( + screenshot=screenshot, + window_event=window_event, + recorded_actions=self.recording.processed_action_events, + replayed_actions=self.action_history, + instructions=self.instructions, + ) + return action + except Exception as e: + logger.error(f"Planning approach also failed: {e}") + return None + + def _execute_planned_action( + self, + screenshot: models.Screenshot, + current_window_event: models.WindowEvent, + ) -> models.ActionEvent | None: + """Try to execute the next planned action assuming it matches reality. + + Args: + screenshot (models.Screenshot): The current state screenshot. + current_window_event (models.WindowEvent): The current state window data. + + Returns: + models.ActionEvent or None: The next action event if the planned target exists. + """ + if self.action_event_idx >= len(self.modified_actions): + return None # No more actions to replay. + + planned_action = self.modified_actions[self.action_event_idx] + self.action_event_idx += 1 + + # Find target element in the current DOM. + recent_visible_html = self.get_recent_visible_html() + soup, target_element = self._find_element_in_dom(planned_action, recent_visible_html) + + if target_element: + planned_action.active_browser_element = target_element + self.action_history.append(planned_action) + return planned_action + else: + raise ValueError("Target element not found in the current DOM.") + + def _generate_next_action_plan( + self, + screenshot: models.Screenshot, + window_event: models.WindowEvent, + recorded_actions: list[models.ActionEvent], + replayed_actions: list[models.ActionEvent], + instructions: str, + ) -> models.ActionEvent | None: + """Fallback method to dynamically plan the next action event. + + Args: + screenshot (models.Screenshot): The current state screenshot. + window_event (models.WindowEvent): The current state window data. + recorded_actions (list[models.ActionEvent]): List of action events from the recording. + replayed_actions (list[models.ActionEvent]): List of actions produced during current replay. + instructions (str): Proposed modifications in natural language instructions. + + Returns: + models.ActionEvent or None: The next action event if successful, otherwise None. + """ + prompt_adapter = adapters.get_default_prompt_adapter() + system_prompt = utils.render_template_from_file("prompts/system.j2") + prompt = utils.render_template_from_file( + "prompts/generate_action_event--browser.j2", # Updated template file name + current_window=window_event.to_prompt_dict(), + recorded_actions=[action.to_prompt_dict() for action in recorded_actions], + replayed_actions=[action.to_prompt_dict() for action in replayed_actions], + replay_instructions=instructions, + ) + + content = prompt_adapter.prompt( + prompt, + system_prompt=system_prompt, + images=[screenshot.image], + ) + action_dict = utils.parse_code_snippet(content) + logger.info(f"{action_dict=}") + if not action_dict: + return None + + return models.ActionEvent.from_dict(action_dict) + + def apply_replay_instructions( + self, + action_events: list[models.ActionEvent], + replay_instructions: str, + ) -> list[models.ActionEvent]: + """Modify the given ActionEvents according to the given replay instructions. + + Args: + action_events: list of action events to be modified in place. + replay_instructions: instructions for how action events should be modified. + + Returns: + list[models.ActionEvent]: The modified list of action events. + """ + action_dicts = [action.to_prompt_dict() for action in action_events] + actions_dict = {"actions": action_dicts} + system_prompt = utils.render_template_from_file("prompts/system.j2") + prompt = utils.render_template_from_file( + "prompts/apply_replay_instructions--browser.j2", # Updated template file name + actions=actions_dict, + replay_instructions=replay_instructions, + # TODO: remove + exceptions=[], + ) + print(prompt) + import ipdb; ipdb.set_trace() + prompt_adapter = adapters.get_default_prompt_adapter() + content = prompt_adapter.prompt(prompt, system_prompt=system_prompt) + content_dict = utils.parse_code_snippet(content) + + try: + action_dicts = content_dict["actions"] + except TypeError as exc: + logger.warning(exc) + action_dicts = content_dict # OpenAI sometimes returns a list of dicts directly. + + modified_actions = [] + for action_dict in action_dicts: + action = models.ActionEvent.from_dict(action_dict) + modified_actions.append(action) + return modified_actions + + def _find_element_in_dom(self, planned_action: models.ActionEvent, html: str): + """Locate the target element in the current HTML DOM. + + Args: + planned_action (models.ActionEvent): The planned action with target element info. + html (str): The current HTML content. + + Returns: + Tuple[BeautifulSoup, Tag or None]: Parsed HTML and the target element or None. + """ + soup = BeautifulSoup(html, 'html.parser') + target_selector = planned_action.active_browser_element # Assuming selector or similar identifier is used. + target_element = soup.select_one(target_selector) # Simplify finding elements. + + return soup, target_element + + def __del__(self) -> None: + """Clean up resources and log action history.""" + self.terminate_processing.set() + action_history_dicts = [action.to_prompt_dict() for action in self.action_history] + logger.info(f"action_history=\n{pformat(action_history_dicts)}") + + +# Define a whitelist of essential attributes +WHITELIST_ATTRIBUTES = [ + 'id', 'class', 'href', 'src', 'alt', 'name', 'type', 'value', 'title', 'data-*', 'aria-*' +] + +def clean_html_attributes(element: BeautifulSoup) -> str: + """Retain only essential attributes from an HTML element based on a whitelist. + + Args: + element: A BeautifulSoup tag element. + + Returns: + A string representing the cleaned HTML element. + """ + whitelist_attrs = [] + + # Go through each attribute in the element and keep only whitelisted ones + for attr_name, attr_value in element.attrs.items(): + if attr_name in WHITELIST_ATTRIBUTES or attr_name.startswith('data-') or attr_name.startswith('aria-'): + whitelist_attrs.append((attr_name, attr_value)) + else: + logger.debug(f"Removing attribute from <{element.name}>: {attr_name}='{attr_value}'") + + # Update the element with only whitelisted attributes + element.attrs = dict(whitelist_attrs) + return str(element) + +def filter_and_clean_html(soup: BeautifulSoup) -> str: + """Filter out irrelevant elements, clean attributes, and log removed elements. + + Args: + soup: BeautifulSoup object of the parsed HTML. + + Returns: + A string representing the cleaned HTML. + """ + # Define relevant elements for action replay + relevant_tags = ['a', 'button', 'div', 'span', 'input', 'img', 'form', 'iframe'] + relevant_elements = [] + + # Find relevant elements and log removal of irrelevant ones + for el in soup.find_all(): + if el.name in relevant_tags: + relevant_elements.append(el) + else: + logger.debug(f"Removing element <{el.name}> with attributes: {el.attrs}") + + # Clean each relevant element + cleaned_elements = [clean_html_attributes(el) for el in relevant_elements] + + # Recreate a simplified HTML structure with only the cleaned elements + return ''.join(cleaned_elements) + +def add_browser_elements(action_events: list) -> None: + """Set the ActionEvent.active_browser_element where appropriate and log actions. + + Args: + action_events: list of ActionEvents to modify in-place. + """ + action_browser_tups = [ + (action, action.browser_event) + for action in action_events + if action.browser_event + ] + for action, browser in action_browser_tups: + soup, target_element = browser.parse() + if not target_element: + logger.warning(f"{target_element=}") + continue + + # Convert BeautifulSoup object to cleaned HTML strings + action.active_browser_element = clean_html_attributes(target_element) + action.available_browser_elements = filter_and_clean_html(soup) + + # Verify the cleaned elements + assert action.active_browser_element, action.active_browser_element + assert action.available_browser_elements, action.available_browser_elements + + import ipdb; ipdb.set_trace() + foo = 2 + + +def run_browser_event_server( + event_q: queue.Queue, + terminate_processing: Event, +) -> None: + """Run the browser event server. + + Params: + event_q: A queue for adding browser events. + terminate_processing: An event to signal the termination of the process. + + Returns: + None + """ + global ws_server_instance + + def run_server() -> None: + global ws_server_instance + with ServerConnection( + lambda ws: read_browser_events(ws, event_q, terminate_processing), + config.BROWSER_WEBSOCKET_SERVER_IP, + config.BROWSER_WEBSOCKET_PORT, + max_size=config.BROWSER_WEBSOCKET_MAX_SIZE, + ) as server: + ws_server_instance = server + logger.info("WebSocket server started") + server.serve_forever() + + server_thread = Thread(target=run_server) + server_thread.start() + terminate_processing.wait() + logger.info("Termination signal received, shutting down server") + + if ws_server_instance: + ws_server_instance.shutdown() + + server_thread.join() + + +def read_browser_events( + websocket: ServerConnection, + event_q: queue.Queue, + terminate_processing: Event, +) -> None: + """Read browser events and add them to the event queue. + + Params: + websocket: The websocket object. + event_q: A queue for adding browser events. + terminate_processing: An event to signal the termination of the process. + + Returns: + None + """ + set_browser_mode("replay", websocket) + utils.set_start_time() + logger.info("Starting Reading Browser Events ...") + + try: + while not terminate_processing.is_set(): + try: + message = websocket.recv(0.01) + except TimeoutError: + continue + timestamp = utils.get_timestamp() + logger.info(f"{len(message)=}") + data = json.loads(message) + assert data["type"] == "DOM_EVENT", data["type"] + event_q.put( + models.BrowserEvent( + timestamp=timestamp, + message=data, + ) + ) + finally: + set_browser_mode("idle", websocket) + diff --git a/openadapt/utils.py b/openadapt/utils.py index 82279c0d4..3949094a6 100644 --- a/openadapt/utils.py +++ b/openadapt/utils.py @@ -16,6 +16,7 @@ import threading import time +from bs4 import BeautifulSoup from jinja2 import Environment, FileSystemLoader from PIL import Image, ImageEnhance from posthog import Posthog @@ -992,6 +993,48 @@ def truncate_html(html_str: str, max_len: int) -> str: return html_str +def parse_html(html: str, parser: str = "html.parser") -> BeautifulSoup: + # Parse the visible HTML using BeautifulSoup + soup = BeautifulSoup(html, parser) + return soup + + +# XXX TODO: +#import html2text +def get_html_prompt(html: str, convert_to_markdown: bool = False) -> str: + """Convert an HTML string to a processed version suitable for LLM prompts. + + Args: + html: The input HTML string. + convert_to_markdown: If True, converts the HTML to Markdown. Defaults to False. + + Returns: + A string with preserved semantic structure and interactable elements. + If convert_to_markdown is True, the string is in Markdown format. + """ + # Parse HTML with BeautifulSoup + soup = BeautifulSoup(html, 'html.parser') + + # Remove non-interactive and unnecessary elements + for tag in soup(['style', 'script', 'noscript', 'meta', 'head', 'iframe']): + tag.decompose() + + assert not convert_to_markdown, "poetry add html2text" + if convert_to_markdown: + # Initialize html2text converter + converter = html2text.HTML2Text() + converter.ignore_links = False # Keep all links + converter.ignore_images = False # Keep all images + converter.body_width = 0 # Preserve original width without wrapping + + # Convert the cleaned HTML to Markdown + markdown = converter.handle(str(soup)) + return markdown + + # Return processed HTML as a string if Markdown conversion is not required + return str(soup) + + class WrapStdout: """Class to be used a target for multiprocessing.Process."""