This roadmap defines what OpenAdapt-ML is today and what will be built next, with concrete implementation targets. It is written to guide both human contributors and autonomous coding agents.
OpenAdapt-ML provides:
- Canonical trajectory schema (Session → Episode → Step → Observation + Action)
- Synthetic UI generators (currently hardened login scenario)
- Next-action SFT dataset builder (strict CLICK/TYPE/DONE DSL)
- Model adapters (Qwen3-VL, Qwen2.5-VL, LoRA-enabled)
- Training loop (simple LoRA SFT)
- Offline evaluation (action accuracy, coord error, click hit rate)
- Runtime policy (regex-parsed Thought/Action output)
This stack is correct but minimal. The next steps expand scale, generality, and real-world usefulness.
This section is the canonical list of what to build, in order, with crisp acceptance criteria.
Why Episode success rate was 0% across ALL models (base, fine-tuned, and API) despite fine-tuned models achieving up to 100% click hit rate. This was a critical evaluation bug that masked the true performance of the system.
Root Causes Identified
-
Missing TYPE and WAIT action parsing (BUG)
AgentPolicy._parse_action()inruntime/policy.pyonly handledCLICKandDONETYPE(text="...")andWAIT()actions fell through toAction(type="failed")- Evidence from logs: 0% accuracy on TYPE (64 steps) and WAIT (32 steps)
-
Overly strict episode success criterion
- Any single action type mismatch fails the entire 7-step episode
- Even with perfect CLICK accuracy, TYPE/WAIT failures guarantee 0% episode success
Fixes Implemented
- ✅ Added
_TYPE_REregex pattern:r'TYPE\(text="([^"\\]*(?:\\.[^"\\]*)*)"\)' - ✅ Added
_WAIT_REregex pattern:r"\bWAIT\s*\(\s*\)" - ✅ Updated
_parse_action()to handle all four DSL actions - ✅ Added
tests/test_action_parsing.pywith comprehensive regex and parsing tests
Acceptance Criteria
- ✅
AgentPolicy._parse_action()correctly parses all DSL actions: CLICK, TYPE, WAIT, DONE - ✅ TYPE action text is properly unescaped (handles
\"and\\) - ✅ Unit tests cover all action types and edge cases
- ✅ Re-run evaluation to measure true episode success rate
Post-Fix Evaluation Results (2B Fine-tuned)
| Metric | Before Fix | After Fix | Change |
|---|---|---|---|
| action_type_accuracy | 0.2545 | 0.4330 | +70% |
| mean_coord_error | 0.0138 | 0.0112 | -19% |
| click_hit_rate | 0.9737 | 1.0000 | Perfect |
| episode_success_rate | 0.0 | 0.0 | No change |
Impact
The parser fix was confirmed successful:
- Click hit rate improved to 100% (was 97.4%)
- Action type accuracy improved by 70% (0.25 → 0.43)
Episode success rate remains 0% because this is now a model learning problem, not a parsing problem. The model is not predicting the correct action type sequences (e.g., predicting CLICK when ground truth is TYPE). With 43% action type accuracy, roughly 3 of 7 steps match per episode, which is insufficient for complete episode success.
Next Steps
The remaining episode success issue requires:
- Analysis of per-action-type accuracy to identify which types the model struggles with
- Potential improvements to training data, loss weighting, or training duration
- See Priority 1 (batching/schedulers) for training infrastructure improvements
Why
Current Qwen3 trainer enforces batch_size=1, blocking GPU throughput and scaling.
Build Targets
-
True batching in
QwenVLAdapter.prepare_inputs- Accept
list[dict]batch input. - Use
processor.apply_chat_template([...], padding=True, truncation=True)for multi-sample tokenization. - Compute assistant-only labels per sample.
- Ensure correct padding masks and label alignment.
- Accept
-
Learning rate schedulers
- Add
lr_scheduler_type: [linear, cosine, none]. - Compute warmup steps from
warmup_ratio.
- Add
-
Run-directory logging
- Every training run creates
runs/<timestamp>_<config>/with:- Config snapshot
- Step-wise loss JSONL
- Optional periodic eval metrics
- Every training run creates
Acceptance Criteria
- Qwen3-VL trains with
per_device_train_batch_size>1. - Loss curve stable.
- Configurable schedulers functional.
- Each run produces a self-contained log directory.
Why
We need a clean, reproducible, public example that demonstrates LoRA fine-tuning improving GUI grounding.
Build Targets
-
Stable eval JSON schema
- Versioned output containing: metrics, run metadata, backend, config path.
-
Golden benchmark results
- Commit eval outputs for:
- Qwen3-VL-2B base vs LoRA-FT
- Qwen3-VL-8B base vs LoRA-FT
- Commit eval outputs for:
-
Plotting upgrade ✅ (implemented and exceeded)
- Comprehensive multi-model comparison plots with legends
- Color-coded bars: blue (Qwen 2B/8B), orange (Claude API), red (GPT API)
- Hatching patterns: solid (base/pretrained), diagonal stripes (fine-tuned)
- Four key metrics per plot: action type accuracy, coord error, click hit rate, episode success
- Supports arbitrary model combinations (base vs FT, offline vs API, comprehensive comparisons)
-
Documentation page
docs/qwen_login_experiment.mddescribing:- Scenario
- Training setup
- Evaluation metrics
- LoRA improvement plots
Acceptance Criteria
- Running:
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \ --config configs/qwen3vl_synthetic_dev.yaml \ --out-dir experiments/qwen_login/2b_devcompletes without error on a supported environment (e.g. CUDA GPU or Apple Silicon / CPU) using the documented config.
- The command above produces at least:
experiments/qwen_login/2b_dev/eval/eval_base.jsonexperiments/qwen_login/2b_dev/eval/eval_ft.jsonexperiments/qwen_login/2b_dev/plots/base_vs_ft.png
- Each eval JSON contains a top-level
metricsobject with:num_episodes,num_steps,action_type_accuracy,mean_coord_error,coord_error_count,episode_success_rate,click_hit_rate.
- For the hardened 2B dev config,
action_type_accuracy_ft - action_type_accuracy_baseis **non-negative and typically >= 0.20` (LoRA does not regress vs. base). - Documentation of the login benchmark is linked from the README.
Why
Today the system only tests login. A second scenario demonstrates robustness and multi-task capacity.
Build Targets
-
Settings Panel Generator
- Multiple toggles.
- Save/Cancel buttons.
- Layout jitter + decoys like login.
-
Scenario mixing
- Extend
generate_synthetic_sessionswith:scenario: ["login", "settings", "mixed"]workflow_idtagging
- Extend
-
Multi-scenario training configs
qwen3vl_multi_scenario.yaml
-
Cross-scenario evaluation matrix
- Train on: login-only, settings-only, mixed.
- Eval on both scenarios.
- Produce generalization heatmaps.
Acceptance Criteria
- Synthetic generator produces both scenarios deterministically.
- Eval matrix visualizes cross-scenario performance.
- Mixed model shows measurable generalization:
- On held-out settings episodes, a model trained on login+settings achieves
at least 0.05 higher
action_type_accuracythan a login-only model, and symmetrically for settings-only vs mixed.
- On held-out settings episodes, a model trained on login+settings achieves
at least 0.05 higher
Why Synthetic-only is useful for unit tests; real world workflows are the end goal.
Status: DONE
Implementation uses openadapt-capture recordings (the modern capture tool) rather than the deprecated legacy OpenAdapt database.
Completed
-
openadapt_ml/ingest/capture.pyingestion module- Maps openadapt-capture recordings → Episode/Step/Action
- Extracts screenshots from video or screenshots/ directory
- Maps events to CLICK/TYPE/DONE actions
- Goals derived from directory name or specified via --goal
-
Training integration
--captureflag in train.py to train on real recordings- Viewer and comparison tools work with real captures
Acceptance Criteria
- Real openadapt-capture recordings load cleanly into canonical schema. ✓
- Training pipeline works end-to-end with captures. ✓
Why
Before introducing cloud orchestration, we want a clean way to run the same
benchmarks against hosted VLM APIs.
Status Implementation complete:
- Configuration System
- Pydantic-settings based configuration (
openadapt_ml/config.py) .envfile support for API key management (.env.exampleprovided)- Priority chain: explicit parameter >
.envsettings > environment variables > raise error - API keys:
ANTHROPIC_API_KEY,OPENAI_API_KEY
- Pydantic-settings based configuration (
- API Adapters
ApiVLMAdapter(openadapt_ml/models/api_adapter.py) wraps:- Anthropic Claude Sonnet 4.5 (
claude-sonnet-4-5-20250929) - OpenAI GPT-5.1 (
gpt-5.1)
- Anthropic Claude Sonnet 4.5 (
- Inference-only adapters implementing
generate()method
- CLI Integration
scripts/eval_policy.pysupports--backend claude/--backend openaiscripts/run_qwen_login_benchmark.pysupports--include-claude,--include-openai, or--include-all-apis
- Visualization
- Comprehensive comparison plots with legends (
plot_eval_metrics.py) - Color-coded bars: blue (Qwen 2B/8B), orange (Claude), red (GPT)
- Hatching patterns: solid (base/pretrained), diagonal stripes (fine-tuned)
- All evaluation plots support multi-model comparison
- Comprehensive comparison plots with legends (
Acceptance Criteria (all met)
- ✅
ApiVLMAdaptercan be dropped intoAgentPolicywithout code changes - ✅ Local API eval CLI produces metrics JSONs compatible with
plot_eval_metrics.py - ✅
ApiVLMAdapterimplementsgenerate(sample: dict) -> strand returns the raw model text (no post-processing beyond what the remote API already performs) - ✅ Configuration system with
.envsupport and clear priority chain - ✅ Comprehensive comparison plots with legends for multi-model evaluation
Future Extensions (optional)
- Add support for additional API providers as needed (e.g., Gemini, other Claude/GPT versions)
- Provider-specific configuration options (temperature, top_p, etc.)
- Richer logging for API calls (token usage, latency metrics)
Why
Lambda is useful for lightweight compute orchestration and API-backed
inference, but not for GPU training.
Build Targets (stretch)
-
Lambda inference endpoint
- Input:
{goal, image_s3_uri}. - Lambda:
- Downloads image.
- Builds SFT-style prompt.
- Calls API-backed adapter.
- Returns parsed Action JSON.
- Input:
-
Synthetic generation Lambda (optional)
- Parallel generation of synthetic batches → S3.
-
Training orchestration Lambda (optional)
- Trigger ECS/SageMaker GPU jobs from configs.
Non-goal
- No local model loading in Lambda (no GPUs, slow cold starts).
Acceptance Criteria (stretch)
- Public Lambda endpoint returns structured
Actionfor any uploaded screenshot. - Adapters work interchangeably: Qwen local vs API remote.
Build Targets
-
CI (GitHub Actions)
uv sync.pytest.rufflint.
-
Critical tests
- Action parser regex.
- Adapter
prepare_inputs(mock tokens). - Metric correctness tests.
-
Style consistency
- Enforced
ruff+black. CONTRIBUTING.mdupdated.
- Enforced
Acceptance Criteria
- Every PR triggers CI pipeline.
- Adapters + metrics covered by unit tests.
This is the order coding agents should follow unless explicitly overridden:
- Priority 0: Fix Episode Success Rate ✅ (parsing fix DONE, but 0% success persists)
- Priority 0.1: Validate prompts on known benchmark
⚠️ NEW- Test on one OSWorld or WebVoyager task to compare against published numbers
- Ensure prompts and action extraction are working correctly
- Reference: TTI repo (
scripts/prompts/create_prompt_json.py)
- Priority 0.2: Establish upper bound with larger models
⚠️ NEW- Prompt Qwen 32B and frontier APIs (Claude, GPT) on synthetic benchmark
- If larger models also fail, the problem is in prompts/action format
- If larger models succeed, smaller models need more training data or better architecture
- Priority 0.3: Achieve >0% episode success
⚠️ BLOCKING- This is the gate for all other work
- Without task completion, all other metrics are noise
- Priority 1: Batching + schedulers + logging.
- Priority 2: Publishable login benchmark (only after >0% success).
- Priority 3: Second synthetic scenario + generalization.
- Priority 4: Real-data ingestion + eval.
- Priority 5a: API adapter + local CLI ✅ (DONE).
- Priority 5b: Lambda orchestration (stretch).
- Priority 6: CI + tests + repo hygiene (continuous).
We skipped the essential first step: validating prompts work on known benchmarks before fine-tuning. The correct order is:
prompts → API baselines → base model comparison → fine-tuning
See docs/internal/vision-notes.md for expert feedback details.
These rules are explicit so agents behave predictably and avoid breaking core contracts:
- DSL stability
- Do not change the DSL grammar (
CLICK,TYPE,WAIT,DONE) or argument names without:- updating all adapters and the runtime parser, and
- extending parser tests to cover the new forms.
- Backward-incompatible changes must bump a
dsl_versionfield wherever it is serialized.
- Do not change the DSL grammar (
- Schema stability
- Always use the canonical schema (
Session/Episode/Step/Observation/Action). - Do not rename these types or their core fields; extensions must be additive (new optional fields) rather than destructive.
- Always use the canonical schema (
- Adapter contract
- All VLM backends must implement the
BaseVLMAdapterinterface (prepare_inputs,compute_loss,generate). - Do not change method signatures; add new behavior via kwargs or new helper methods instead.
- All VLM backends must implement the
- Synthetic scenario invariants
- All new scenarios must use:
- Layout jitter.
- At least one decoy element.
- Deterministic random seeds for reproducible benchmarks.
- All new scenarios must use:
- Eval invariants
- All new eval CLIs must reuse the existing trajectory-matching metrics (action type accuracy, coord error, episode success rate, click hit rate), or extend them in a strictly additive way.
- Policies and adapters must not rewrite or normalize DSL text (no JSON
wrapping, added prefixes like
Action: CLICK(...), or whitespace rewriting) beyond strict parsing into anAction; the original output must be preserved in logs /Action.raw.