Skip to content

Conversation

@SimonHeybrock
Copy link
Member

@SimonHeybrock SimonHeybrock commented Dec 11, 2025

Summary

Implements a workflow status and control widget that replaces the old job-based UI with a workflow-centric interface. Each workflow is displayed as a collapsible card showing status, timing, controls, and configuration staging area.

Closes #563

Example

image

Notes

This will probably fully replace the dropdown-based workflow creation widget in the sidebar and likely also the JobStatusListWidget (first tab).

Key Features

WorkflowStatusWidget - Per-workflow control panel:

  • Color-coded status badge (active/stopped/error/pending/paused)
  • Timing information (time since start, last heartbeat)
  • Action buttons: Configure, Start, Stop, Reset
  • Staging area with config toolbars grouped by identical configurations
  • Commit button appears when staged config differs from active config
  • Output chips showing available workflow outputs
  • Stale status detection via heartbeat timeout

WorkflowStatusListWidget - Manages multiple workflow widgets:

  • Expand/collapse all buttons for bulk operations
  • Persists expand/collapse state per workflow
  • Scrollable container for many workflows

Architecture Enhancements

Widget Lifecycle Subscriptions

Added WidgetLifecycleCallbacks to JobOrchestrator for cross-session synchronization. Widgets subscribe to events (on_staged_changed, on_workflow_committed, on_workflow_stopped) so all browser sessions stay synchronized when any session modifies state.

Staging Transactions

Added staging_transaction() context manager to batch multiple staging operations (e.g., clearing and re-staging configs) into a single notification. This prevents N+1 UI rebuilds when replacing all staged configs.

Backend Status as Source of Truth

Widget uses JobService status as authoritative state, rather than maintaining parallel frontend state. This ensures consistency across sessions and prevents drift between UI and backend.

Documentation

Added .claude/rules/dashboard-widgets.md documenting the cross-session synchronization pattern to prevent future widgets from breaking multi-session support.

SimonHeybrock and others added 18 commits December 13, 2025 05:45
Implements a new collapsible card widget for displaying and controlling
workflow status. Each workflow shows:
- Header with title, status badge, timing info, and reset/stop buttons
- Configuration section with toolbars grouped by identical configs
- Output chips showing available workflow outputs
- Commit button when staged config differs from active config

Key features:
- Expand/collapse via button in header
- Gear button opens config modal, remove button unstages sources
- Yellow border indicates modified configs, yellow background for unconfigured
- Incremental status/timing updates to avoid flicker

Also adds get_workflow_registry() method to JobOrchestrator for widget access.

Prompt: Please think through #563. job-list-widget.png shows the existing
JobStatus(List)Widget which this work will eventually replace. We should keep
working ideas and design (buttons should follow other toolbars, as mentioned
in issue). Please come up with a suggestion (drawing, or simple html mockup?).
As me one question at a time about unclear things. Avoid going into any details
that are not relevant for an initial implementation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
The configuration modal was committing workflows immediately instead of
staging changes for review. This caused two issues:
- Widget didn't rebuild to show staged config changes
- Removing/adding sources via modal had no visible effect until manual commit

Changes:
- Add WorkflowController.stage_workflow() to stage configs without committing
- Add WorkflowStatusListWidget.rebuild_widget() to trigger display updates
- Update create_workflow_adapter() to use stage_workflow in callback
- Modal now stages changes and rebuilds widget on success
- User reviews staged changes and clicks "Commit & Restart" to apply

This restores the intended staging/review workflow:
1. Configure sources via modal → staged (not committed)
2. Review changes in widget (yellow border shows modifications)
3. Commit when ready via "Commit & Restart" button

Original prompt: Please have a look at the latest commit, we just added
WorkflowStatusWidget. screenshot-latest.png shows the current state. A couple
of things are not working yet:
- Removing a source name from an existing config does not update the row, nor
  do we get a row with unconfigured source names.
- Configuring unconfigured sources does not add a new row.
Essentially, configuring does not seem to have any effect.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Add ability to stop active workflows from the UI, addressing the issue where
workflows appear active but backend is not running. Users can now manually
clear false active states and regain control.

Changes:
- Add JobOrchestrator.stop_workflow() to send stop commands and clear local state
- Add WorkflowController.stop_workflow() wrapper for controller layer
- Implement WorkflowStatusWidget._on_stop_click() to handle stop button clicks
- Add comprehensive test suite (6 tests) covering stop workflow behavior
- Preserve staged configuration after stop for easy restart
- Persist stopped state to config store

The stop button:
- Sends stop commands to backend (works even if backend is down)
- Clears local active job state immediately
- Preserves staged config for restart
- Updates UI to show stopped state

This unblocks stuck workflows where JobOrchestrator believes job is active
but no backend status messages arrive (e.g., backend not running).

Phase 2 will address the root cause by changing state determination to rely
on JobStatus from backend rather than local JobOrchestrator state.

Original prompt: "Help me understand how WorkflowStatusWidget determines the job
state. Currently I see cases where it displays ACTIVE | Starting... for job where
the backend is not up. My guess is that of JobOrchestrator has a "current" configured
job it is considered as ACTIVE, even if we do not have a status heartbeat from the
backend? And there is no mechanism that auto-starts if the backend is available."

Follow-up: "Yes, please implement phase 1. I support JobOrchestrator also needs to
notify PlotOrchestrator that the workflow has stopped (so it can remove the plot)?"
Decision: Keep current behavior (plots remain visible after stop) for now.
…ruth (Phase 2)

Introduce SCHEDULED state to distinguish between workflows where commands
have been sent vs. workflows actually running on the backend. This fixes
the issue where workflows showed "ACTIVE | Starting..." indefinitely when
the backend was not running.

State determination now follows this logic:
- STOPPED: No job configured in orchestrator
- SCHEDULED: Job configured but no backend status messages received
- ACTIVE/ERROR/WARNING/PAUSED: Based on backend JobStatus messages

Changes:
- Add SCHEDULED state to WorkflowWidgetStyles.STATUS_COLORS (blue)
- Modify _get_workflow_status() to check backend status as source of truth
- Update _get_timing_text() to show "Waiting for backend..." for SCHEDULED
- Add 5 tests covering SCHEDULED state transitions and behavior

Benefits:
- Accurate state: UI reflects actual backend state, not frontend beliefs
- No false active state: Backend down = SCHEDULED, not ACTIVE
- Clear feedback: User knows workflow is waiting for backend
- Unblocks restart: Stop button works to clear SCHEDULED state

This addresses the root cause identified in Phase 1 where JobOrchestrator's
local state (.current) was treated as the source of truth even when no
backend was running.

Phase 2.5 will add heartbeat timeout mechanism to auto-transition ACTIVE
to SCHEDULED when backend stops responding.

Original issue: "I see cases where it displays ACTIVE | Starting... for job
where the backend is not up. My guess is that JobOrchestrator has a 'current'
configured job it is considered as ACTIVE, even if we do not have a status
heartbeat from the backend?"
Implement automatic detection of stale backend status to handle cases where
the backend stops responding (crashes, network issues, etc.). Workflows now
automatically transition from ACTIVE to SCHEDULED when status heartbeats
exceed the timeout threshold (60 seconds).

Changes:
- Add timestamp tracking in JobService.status_updated() for each status update
- Add JobService.is_status_stale() method to check if status is older than timeout
- Modify WorkflowStatusWidget to check staleness when determining workflow state
- Add STATUS_HEARTBEAT_TIMEOUT_NS constant (60 seconds)
- Add 3 tests covering heartbeat timeout behavior

Behavior:
- Fresh status (< 60s old): Treated as valid, shows ACTIVE/ERROR/WARNING/PAUSED
- Stale status (> 60s old): Treated as missing, shows SCHEDULED
- Missing status: Treated as SCHEDULED

This completes the state management improvements:
- Phase 1: Stop button to manually clear false active states
- Phase 2: Use backend status as source of truth (SCHEDULED state)
- Phase 2.5: Auto-detect stale heartbeats and transition to SCHEDULED

Benefits:
- Automatic recovery from backend crashes/restarts
- Clear feedback when backend becomes unresponsive
- No manual intervention needed to detect backend failures
- Configurable timeout threshold for different deployment scenarios

Original request: "Yes, please implement phase 1. [...] Then summarize our plan
for phase 2 to remind me."
Follow-up: "Sounds good, please implement. Commit when tested, then proceed with
phase 2.5."
Change workflow state from SCHEDULED to PENDING to more accurately reflect
what's actually happening: we send commands to Kafka and wait for backend
response, but we don't actively reschedule or retry.

Rationale:
- SCHEDULED implies active rescheduling mechanism (which we don't have)
- PENDING accurately describes "command sent, waiting for response"
- Backend execution depends on Kafka message persistence and consumer config
- We rely on Kafka's queue, not our own scheduling logic

Changes:
- Rename 'scheduled' to 'pending' in STATUS_COLORS (blue color unchanged)
- Update all docstrings: SCHEDULED → PENDING
- Update all test names and assertions
- Update comments mentioning "scheduled"

No behavioral changes - purely semantic clarification.

The state flow remains:
STOPPED → PENDING → ACTIVE
         (command sent,  (backend
          waiting)        confirmed)

Discussion context: "I wonder if SCHEDULED is correct when backend is not
reachable. After all, we never try to re-schedule, right? So should the UI
change? Or should we keep this and (as a future improvement) add a mechanism
that tries to re-schedule every once in a while?"

Decision: Rename to PENDING as it's more accurate. Future enhancement could
add auto-retry mechanism and introduce true SCHEDULED state if needed.
Add "Expand all" and "Collapse all" buttons to the workflow status list
header for quickly expanding or collapsing all workflow cards at once.

- Add set_expanded() method to WorkflowStatusWidget for programmatic control
- Add header row with expand/collapse all buttons aligned to the right
- Follow existing button styling patterns (light type, border, hover effect)
- Add tests for expand/collapse all functionality

Prompt: Please enhance WorkflowStatusListWidget by adding "Expand all" and
"Collapse all" buttons. Follow existing styling.

🤖 Generated with [Claude Code](https://claude.ai/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Replace full widget rebuilds with targeted property updates for smoother
expand/collapse transitions:

- Add _update_expand_state() to update only button, visibility, and border
- Store references to _expand_btn, _header, _body for direct updates
- Use pn.io.hold() in expand/collapse all to batch DOM changes

Prompt: This (and also open/close of individual cards) is a bit "janky".
Can we update in a smarter way? Or would a pn.io.hold help?

🤖 Generated with [Claude Code](https://claude.ai/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Remove WorkflowController dependency from WorkflowStatusWidget and
WorkflowStatusListWidget. The widget now uses JobOrchestrator directly
for stop/commit/config operations.

Changes:
- Revert workflow_controller.py to main (remove stage_workflow and
  stop_workflow methods added in this branch)
- Remove workflow_controller parameter from widget constructors
- Widget now calls orchestrator.stop_workflow() directly
- Add TODO comments where _build_widget() calls remain (to be replaced
  with lifecycle subscription callbacks in Phase 2)
- Update reduction.py and test fixtures

Phase 2 (TODO): Add lifecycle subscriptions to JobOrchestrator for
cross-session synchronization.

Prompt: "Please look through @docs/developer/plans/workflow-status-widget-architecture-fix.md - please perform the revert, and also strip all "bad"/"offending" code (e.g. in the widget) that needs to replaced (do not implement new bits, leave a TODO), or methods that will no longer be needed."

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Implement the widget lifecycle subscription mechanism in JobOrchestrator
so that WorkflowStatusWidget instances across different browser sessions
stay synchronized when workflow state changes.

Changes:
- Add WidgetLifecycleCallbacks dataclass with on_staged_changed,
  on_workflow_committed, and on_workflow_stopped callbacks
- Add subscribe_to_widget_lifecycle() and unsubscribe_from_widget_lifecycle()
  methods to JobOrchestrator
- Add notification calls in stage_config(), clear_staged_configs(),
  commit_workflow(), and stop_workflow()
- Update WorkflowStatusWidget to subscribe to lifecycle events and rebuild
  via callback instead of direct _build_widget() calls after actions
- Add 10 tests for the new subscription mechanism

This completes Phase 2 of the WorkflowStatusWidget architecture fix as
described in docs/developer/plans/workflow-status-widget-architecture-fix.md

Prompt: Please think through @docs/developer/plans/workflow-status-widget-architecture-fix.md and perform the remaining work.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The WorkflowStatusWidget already rebuilds automatically via its subscription
to the JobOrchestrator's lifecycle events. When the modal's start_callback
invokes start_workflow() → commit_workflow(), all subscribed widgets are
notified and rebuild.

The manual rebuild_widget() call was redundant and caused a double rebuild.

This change:
- Removes the manual rebuild_widget(workflow_id) call
- Replaces with explanatory comment about automatic subscription-based rebuild
- Improves performance by eliminating the duplicate rebuild

Related to workflow-status-widget-architecture-fix.md Phase 2.
Adds a cleanup() method to properly unsubscribe from JobOrchestrator
lifecycle events when the widget is destroyed. This prevents memory
leaks from lingering subscriptions in dynamic widget scenarios.

Changes:
- Add cleanup() method to WorkflowStatusWidget
- Unsubscribes from orchestrator widget lifecycle events
- Documents that JobService cleanup isn't available (pre-existing limitation)
- Add test to verify cleanup unsubscribes correctly

The cleanup() method should be called when the widget is removed from
the UI. In typical usage (widgets live entire session), this isn't
critical, but it's good practice for proper resource management.

Note: JobService lacks an unregister mechanism, so that subscription
cannot be cleaned up. This is a pre-existing issue affecting all
widgets using JobService.

Related to workflow-status-widget-architecture-fix.md review feedback.
Adds JobOrchestrator.replace_staged_configs() to replace all staged configs
in a single operation with a single notification to subscribers. This
significantly reduces UI rebuilds during workflow start.

Before this change, start_workflow triggered N+2 widget rebuilds:
- clear_staged_configs() → rebuild 1
- stage_config() for source 1 → rebuild 2
- stage_config() for source 2 → rebuild 3
- ...
- stage_config() for source N → rebuild N+1
- commit_workflow() → rebuild N+2

After this change, it triggers only 2 rebuilds:
- replace_staged_configs() → rebuild 1 (all sources)
- commit_workflow() → rebuild 2

Changes:
- Add replace_staged_configs() method to JobOrchestrator
- Update WorkflowController.start_workflow() to use replace_staged_configs()
- Add 3 tests verifying single notification, config replacement, and copying

This improves performance and reduces UI flicker when starting workflows
with multiple sources.

Related to workflow-status-widget-architecture-fix.md review feedback.
…nism

Adds staging_transaction() context manager to JobOrchestrator to batch multiple
staging operations with a single notification. This replaces the bespoke
replace_staged_configs() method with a more general, future-proof pattern.

Benefits:
- Handles any combination of staging operations, not just specific use cases
- New operations automatically benefit from batched notifications
- Follows standard Python context manager pattern
- Cleaner API with fewer methods to maintain

Changes:
- Add staging_transaction() context manager to JobOrchestrator
- Support nested transactions for the same workflow
- Reject attempts to nest transactions for different workflows
- Update _notify_staged_changed() to defer notifications during transactions
- Update WorkflowController.start_workflow() to use transaction
- Update WorkflowStatusWidget._on_remove_click() to use transaction
- Remove deprecated replace_staged_configs() method
- Replace 3 replace_staged_configs tests with 4 comprehensive transaction tests

This fixes the N+1 notification problem in both start_workflow (N+2 → 2
notifications) and _on_remove_click (N+1 → 1 notification), and prevents
similar issues in future code.

Original prompt: Have a look at the latest commit and related code. Think about whether JobOrchestrator should have a transaction mechanism instead of some bespoke methods.

Follow-up: Yes, please make a todo list and implement!

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@SimonHeybrock SimonHeybrock force-pushed the workflow-control-widget branch from d6f5183 to cb5c6fc Compare December 13, 2025 04:45
SimonHeybrock and others added 5 commits December 13, 2025 05:46
- Add create_workflow_adapter() to JobOrchestrator for creating workflow
  configuration adapters directly, eliminating need for WorkflowController
  in the WorkflowStatusWidget flow

- WorkflowStatusWidget now handles configuration modal internally:
  - Creates adapter via orchestrator.create_workflow_adapter()
  - Manages its own modal container (following PlotGridTabs pattern)
  - Removes dependency on injected on_configure callback

- Simplify reduction.py by removing:
  - on_configure_workflow callback
  - _workflow_config_modal_container
  - _cleanup_workflow_config_modal method

- Update tests to reflect new widget API

This change aligns WorkflowStatusWidget with the JobOrchestrator-centric
architecture, making the widget more self-contained and removing an
unnecessary layer of indirection through WorkflowController.

Prompt: on_configure_workflow in reduction.py seems misplaced. Can we move
this functionality into JobOrchestrator, without using WorkflowController
at all?

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@SimonHeybrock SimonHeybrock marked this pull request as ready for review December 13, 2025 05:41
@github-project-automation github-project-automation bot moved this to In progress in Development Board Dec 13, 2025
@SimonHeybrock SimonHeybrock moved this from In progress to Selected in Development Board Dec 13, 2025
SimonHeybrock and others added 7 commits December 15, 2025 05:00
The test assumed widgets start expanded, but commit ff55f1a changed
this to start collapsed. Update test to first expand widgets before
testing that collapse_all() works.

Prompt: "I think the lest commit broke tests/dashboard/widgets/workflow_status_widget_test.py::TestWorkflowStatusListWidget::test_collapse_all_collapses_all_widgets"
Update ConfigurationState schema to store per-source params and
aux_source_names instead of a single shared config. This enables
WorkflowStatusWidget to correctly manage different configurations
for disjoint subsets of sources.

Changes:
- Add JobConfigState model for per-source configuration
- Update ConfigurationState to use jobs: dict[str, JobConfigState]
- Add set_selected_sources() method to ConfigurationAdapter for
  scoping adapters to source subsets
- Update initial_source_names, initial_parameter_values, and
  initial_aux_source_names to respect scoped sources
- Update JobOrchestrator._load_configs_from_store to load per-source
  configs from new schema
- Update _save_workflow_config and _get_config_state to use new schema
- Update WorkflowController.get_workflow_config similarly
- Remove hasattr check in WorkflowStatusWidget._on_gear_click

The "expand on load" mechanism (applying single params to all sources)
is no longer needed as each source now has its own config stored.

Prompt: Please think through JobOrchestrator.create_workflow_adapter -
find it suspicious that it uses 'first_job_config = next(iter(
staged_jobs.values()))'. WorkflowStatusWidget allows for managing
different configs for disjoint subsets of source names. How is it
maintaining/getting the correctly configured adapter/ConfigurationWidget
for each?

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The start_callback in create_workflow_adapter was clearing ALL staged
configs before staging the selected sources. This caused independently
configured sources to lose their configs when another source was
configured via a scoped adapter.

Remove the clear_staged_configs() call so that stage_config() only
updates the selected sources, preserving configs for other sources.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Wrap _build_widget in pn.io.hold() to batch DOM updates into a single
browser transaction. Previously, the clear() followed by extend() caused
a visible empty state between operations.

Prompt: In workflow_status_widget.py there is significant "flicker" when
updating the config of a workflow. Any idea why?

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Wrap the Apply button handler in pn.io.hold() to batch all DOM updates:
config staging, modal close, and container cleanup into a single browser
transaction.

Prompt: Better, but there is still some odd flicker when closing the modal.
No flicker when "cancelling", but when "applying" it seems to whole screen
flickers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Inverts the data model so ConfigurationState represents a single source's
configuration (what JobConfigState was), and JobOrchestrator manages the
collection. This removes the mismatch where ConfigurationAdapter could only
work with one source's config but received a bundle of all sources.

Key changes:
- ConfigurationState now has params/aux_source_names (single source)
- Removed JobConfigState class (merged into ConfigurationState)
- Removed set_selected_sources() - replaced by initial_source_names param
- create_workflow_adapter() accepts selected_sources parameter directly
- Orchestrator/controller fall back to staged sources for initial selection

The persistence format remains unchanged ({"jobs": {...}}) - only the
Pydantic model semantics changed.

Prompt: "Please help me think through the changes in configuration_adapter.py
relating to JobOrchestrator in this branch. I wonder about the reasoning
behind making ConfigurationState wrap the config of multiple jobs."

Follow-up: User suggested inverting the design so JobOrchestrator stores the
collection and ConfigurationAdapter works with single-source config directly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@SimonHeybrock
Copy link
Member Author

We want to deploy this today and will thus merge, but review is still needed!

@SimonHeybrock SimonHeybrock merged commit 0d4df9c into main Dec 15, 2025
4 checks passed
@SimonHeybrock SimonHeybrock deleted the workflow-control-widget branch December 15, 2025 08:14
@github-project-automation github-project-automation bot moved this from Selected to Done in Development Board Dec 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Workflow status widget

1 participant