feat: add Linux desktop automation support via AT-SPI2#356
Conversation
Add Linux as a first-class platform using AT-SPI2 accessibility framework via node-gtk for accessibility tree snapshots. This mirrors the macOS desktop automation approach using accessibility snapshots. New files: - src/platforms/linux/atspi-bridge.ts: Core AT-SPI2 bridge using node-gtk with lazy loading, recursive tree traversal (max 1500 nodes, depth 12) - src/platforms/linux/role-map.ts: AT-SPI2 role normalization (~100 roles mapped to existing snapshot type conventions) - src/platforms/linux/snapshot.ts: Snapshot entry point with surface, scope, depth, and interactive-only filtering support - src/platforms/linux/devices.ts: Local device discovery for Linux - src/platforms/linux/node-gtk.d.ts: Type declarations for node-gtk Integration: - Extended Platform type with 'linux', backend union with 'linux-atspi' - Wired snapshot into dispatch.ts and snapshot-capture.ts - Added Linux device discovery to dispatch-resolve.ts - Added stub interactor (input actions deferred to Phase 3) - Added 'linux' to CLI --platform flag - node-gtk added as optional dependency (only installs on Linux) https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0d880e00b1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/core/dispatch-resolve.ts
Outdated
|
|
||
| const devices: DeviceInfo[] = []; | ||
| try { | ||
| devices.push(...(await listLinuxDevices())); |
There was a problem hiding this comment.
Preserve Android auto-selection before Linux host device
This new discovery order prepends the synthetic local Linux device ahead of Android candidates in the no---platform path, which changes default targeting on Linux hosts. Because resolveDevice keeps discovery order when multiple booted physical devices are equally valid, a connected Android phone (kind: device) is now commonly displaced by the Linux host entry, so existing commands that relied on implicit Android selection start resolving to Linux instead.
Useful? React with 👍 / 👎.
src/core/interactors.ts
Outdated
| case 'linux': | ||
| return { | ||
| open: () => { | ||
| throw new AppError('UNSUPPORTED_OPERATION', 'open not yet supported on Linux'); | ||
| }, |
There was a problem hiding this comment.
Gate unsupported Linux commands in capability checks
The Linux interactor path is mostly stubbed with UNSUPPORTED_OPERATION throws, but Linux was added as a platform without a Linux-specific capability matrix, so command support checks still treat Linux like Android and allow commands such as open, click, and fill to proceed until they fail at runtime here. This creates false-positive support signals and inconsistent behavior across handlers that rely on isCommandSupportedOnDevice for early rejection.
Useful? React with 👍 / 👎.
- Extract SnapshotBackend type alias to replace repeated string union across 5 files (snapshot.ts, snapshot-capture.ts, session-replay-heal.ts, interaction.test.ts) - Remove duplicate scope/interactive/depth filtering from linux/snapshot.ts — let the existing buildSnapshotState pipeline handle it, same as Android - Extract isDesktopBackend() helper in snapshot-capture.ts to consolidate the "skip mobile semantics" pattern for macos-helper and linux-atspi - Collapse 17 repetitive throw statements in Linux interactor stubs into a linuxStub() factory function https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
Add xdotool/ydotool input actions (tap, swipe, scroll, type, fill, right/middle click, long press, double click), screenshot capture via grim/scrot, and app lifecycle management (open, close, back, home). Wire Linux interactors with real implementations and fix device discovery order so Linux doesn't displace Android in auto-selection. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
- Extract linux-env.ts with cached display server + input tool detection so every action avoids repeated `which` lookups - Add moveTo/clickButton/sendKey helpers to eliminate repeated mousemove boilerplate across 5 mouse actions - Make scrollLinux respect amount/pixels options instead of hardcoded scroll count - Have backLinux/homeLinux reuse sendKey instead of duplicating tool detection https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
Add GitHub Actions workflow that boots a virtual X11 display (Xvfb), installs AT-SPI2 accessibility tooling and xdotool, opens gnome-calculator, takes screenshots, and captures an accessibility snapshot. Screenshots are uploaded as artifacts for visual verification. Also adds 'linux' to replay script metadata platforms and a test:replay:linux script to package.json. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
The replay runner requires an active session before any commands can run. Move the screenshot after the open command that creates the session. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
- Add gobject-introspection, libcairo2-dev, build-essential for node-gtk native compilation - Split AT-SPI2 registry start into its own step so it picks up DBUS_SESSION_BUS_ADDRESS from GITHUB_ENV - Set GTK_A11Y=atspi, GTK_MODULES=gail:atk-bridge, NO_AT_BRIDGE=0 to ensure GTK apps expose their accessibility tree on headless CI - Set GSETTINGS_BACKEND=memory to avoid dconf failures - Add node-gtk verification step to catch build failures early https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
pnpm install silently skips failed optional dependency builds and the pnpm cache may not include the native binary. Force a rebuild after install to ensure the node-gtk .node binding is compiled against the system GI/cairo headers. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
pnpm rebuild doesn't trigger node-pre-gyp properly for optional deps. Run node-pre-gyp install --fallback-to-build --update-binary directly inside the node-gtk package directory to force compilation when no prebuilt binary exists for the current Node ABI (v127 / Node 22). https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
node-gtk is a native C++ addon that requires compilation against specific Node ABI versions and GObject Introspection headers. This proved unreliable on CI (no prebuilt binaries for Node 22 ABI v127, silent optional dep build failures, pnpm cache staleness). Replace it with a Python helper script (atspi-dump.py) that uses PyGObject — the reference GObject Introspection consumer. python3-gi is trivially installable on any Linux distro with no compilation step. The Node bridge spawns `python3 atspi-dump.py` and parses JSON output. - Remove node-gtk from optionalDependencies - Remove node-gtk.d.ts type stub - Add atspi-dump.py (~200 lines) doing the same tree traversal - Rewrite atspi-bridge.ts to use subprocess instead of in-process GI - Simplify CI workflow: no more native build deps or rebuild steps https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
python3-gi, gir1.2-atspi-2.0, at-spi2-core, and dbus-x11 are already present on Ubuntu GitHub Actions runners. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
- Allow --surface desktop and --surface frontmost-app on Linux (previously only macOS could use --surface) - Add unit tests for atspi-bridge (9 tests: JSON parsing, role normalization, null coercion, error handling, arg forwarding) - Add unit tests for role-map (3 tests: common roles, case normalization, PascalCase fallback) - Improve .py script path resolution (walk upward instead of hardcoded relative paths) - CI replay test now asserts snapshot contains calculator UI nodes via is-exists https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
Document the shared schema, traversal rules, surface semantics, and normalized role types that all snapshot backends (Swift, Python, Android) must conform to. This serves as the single source of truth when adding or modifying platform backends. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
pnpm-lock.yaml still referenced node-gtk after it was removed from package.json, causing pnpm install --frozen-lockfile to fail in CI. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
- atspi-dump.py: use ctx dict for traversal limits instead of globals, fix rect filter (width/height <= 0 should use `or`), add surface validation - input-actions.ts: make sendKey scancodes required to prevent silent no-op on ydotool, fix ydotool longPress/swipe to use click --down/--up - app-lifecycle.ts: use pkill -x (exact match) instead of pkill -f - linux-env.ts: emit diagnostic warning when falling back to xdotool on Wayland https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
Ubuntu runners may not have at-spi2-core, python3-gi, gir1.2-atspi-2.0, or dbus-x11 pre-installed. Install them explicitly instead of assuming they exist. Also make the verify step's tree dump non-fatal since no apps are running at that point. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
The selector parser tokenizes on whitespace, so `role=push button` was split into two tokens causing a parse failure. Use single quotes inside the selector: `role='push button'`. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
appName is not a valid selector key. The supported keys are: id, role, text, label, value, visible, hidden, editable, selected, enabled, hittable. Simplified to use label and role only. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
- snapshot.ts: emit diagnostic warning when menubar surface is requested on Linux (falls back to desktop silently otherwise) - SNAPSHOT_CONTRACT.md: fix unmapped role example to use a role that isn't actually mapped (was "color chooser" which maps to Dialog) https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
P0: Add explicit Linux capability matrix with 3-way platform routing (Apple/Linux/Android) in isCommandSupportedOnDevice. Linux now correctly blocks unsupported commands (clipboard, rotate, scrollIntoView, etc.) at capability level rather than throwing at runtime. Includes tests. P0: Expand Linux CI to run typecheck + unit tests before smoke tests. Add AT-SPI2 registry health probe with fail-fast on missing registry. P1: Harden atspi-dump.py — arg parsing now produces JSON errors on bad int values, and a top-level catch wraps unexpected exceptions in JSON. P1: Add 10s per-action timeout to xdotool/ydotool input commands to prevent indefinite hangs. P1: Tighten smoke test selectors to calculator-specific signals (digit labels) instead of generic role='push button'. P2: Document Linux surface mapping, supported commands, and known limitations in SNAPSHOT_CONTRACT.md. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
Linux snapshots were bypassing snapshotInteractiveOnly and snapshotDepth filtering that macOS-helper gets via shapeDesktopSurfaceSnapshot. Route Linux through the same function so snapshot -i and --depth flags work. Renamed shapeMacOsSurfaceSnapshot → shapeDesktopSurfaceSnapshot since it's now shared between macOS and Linux desktop backends. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
- app-lifecycle.ts: emit diagnostic on fire-and-forget app launch failure instead of silently swallowing errors - linux-env.ts: make xdotool on Wayland a hard error instead of a broken fallback (xdotool doesn't work on Wayland) - atspi-bridge.ts: increase Python subprocess timeout from 15s to 30s for safety on slow/loaded systems with large a11y trees https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
Selectors: - Add appname and windowtitle as selector keys for desktop platforms. Both macOS and Linux snapshots already populate these fields — now they're usable in selector expressions (e.g., "label=OK appname=Calc"). Keys are case-insensitive. Clipboard: - Implement readLinuxClipboard/writeLinuxClipboard using xclip/xsel (X11) or wl-copy/wl-paste (Wayland) with descriptive TOOL_MISSING errors. Enable clipboard in Linux capability matrix. 7 unit tests. Input action tests: - Add 18 unit tests covering xdotool and ydotool code paths: press, right/middle click, double click, sendKey, type, scroll, swipe, focus, fill. Tests mock runCmd and verify correct tool + args. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
…_info helper Avoid repeated `which` calls on every screenshot/clipboard operation by caching the resolved tool on first use, matching the input-action pattern. Extract duplicated app_name/pid retrieval in atspi-dump.py into get_app_info. https://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT
Add Linux as a first-class desktop automation platform using the AT-SPI2 accessibility framework, mirroring the macOS approach with accessibility snapshots.
Architecture
Uses a Python subprocess (
atspi-dump.py) with PyGObject bindings to capture the AT-SPI2 accessibility tree. This avoids native compilation issues (node-gtk) and leverages the reference GObject Introspection consumer that's available on all Linux desktops.New platform files (
src/platforms/linux/)atspi-dump.pyatspi-bridge.tsRawSnapshotNode[]role-map.tssnapshot.tsSessionSurfaceto AT-SPI2SnapshotSurfaceinput-actions.tsxdotool(X11) /ydotool(Wayland)screenshot.tsscrot/grimwith fallback chainsapp-lifecycle.tslinux-env.tsdevices.tsIntegration changes
Platformtype with'linux',SnapshotBackendwith'linux-atspi'isDesktopBackend()helper consolidating macOS + Linux treatmentapp,desktop,frontmost-app)'linux'to replay metadata platforms and CLI--platformflagTests & CI
atspi-bridge.ts(9 tests) androle-map.ts(3 tests).github/workflows/linux.yml): Xvfb + D-Bus + AT-SPI2 registryDocumentation
src/platforms/SNAPSHOT_CONTRACT.md: Cross-platform traversal contract documenting output schema, traversal rules, surface semantics, and normalized role typeshttps://claude.ai/code/session_01H9hrmueNF5pcBM8JeX81mT