Two-node topology (TNA/TNF) workflow automation for OpenShift edge deployments.
Install via Claude Code's plugin system:
/plugin marketplace add openshift-eng/edge-tooling
/plugin install two-node
podmanfor running the MCP server containerJIRA_USERNAMEenvironment variable set with your Red Hat emailJIRA_API_TOKENenvironment variable set with a Jira API token
The plugin includes an .mcp.json that automatically configures the mcp-atlassian MCP server.
- Claude Code session open at the Two-Node Toolbox (TNT) repo (
two-node-toolbox/deploy/ortwo-node-toolbox/deploy/openshift-clusters/). Running from any other directory will result in an error. - EC2 instance running with
make inventorycompleted - EC2 configured (
./configure) and SSH-accessible - Pull secret at
deploy/openshift-clusters/roles/dev-scripts/install-dev/files/pull-secret.json(relative to repo root)
Create OCPEDGE stories for TNF RHEL verification tickets, link them to the RHEL bugs, and set the required components (Two Node Fencing, QE, RHEL-Verification).
# Auto-discover untested tickets
/two-node:create-rhel-stories
# Dry run (preview without changes)
/two-node:create-rhel-stories --dry-run
# Specific tickets
/two-node:create-rhel-stories RHEL-12345 RHEL-12346 RHEL-12347
# JQL query
/two-node:create-rhel-stories jql:project = RHEL AND component = "resource-agents" AND status != Closed
Features:
- Auto-discovery of untested TNF resource-agents RHEL tickets
- Clone expansion across the full clone tree
- Dry-run mode for previewing without modifying Jira
- Closed story handling (creates new stories for untested tickets)
- Subtask creation (verification + automation)
Verify a RHEL resource-agents bug fix on a TNF cluster. Given a JIRA ID, the skill fetches the full bug context from Jira (title, z-stream, upstream PR, linked OCPEDGE tracking ticket, test instructions), then walks through the verification workflow.
/two-node:verify-rhel-bugfix RHEL-157145
/two-node:verify-rhel-bugfix https://issues.redhat.com/browse/RHEL-157145
Workflow:
- Gather info from Jira — Fetches the RHEL ticket, follows links to upstream OCPBUGS bug (extracts PR, commit, author), finds OCPEDGE tracking ticket and sprint. Only asks the user for the RPM location and verification type.
- Check cluster — Runs
verify-cluster.shto check OCP version, RHCOS, pacemaker status, etcd health, and current RPM version - Patch nodes — Runs
patch-nodes.shto distribute the RPM, applyrpm-ostree override replace -C, reboot, and verify - Run test — Code-only verification (grep for fix code) or functional test (shutdown/fence/standby scenarios)
- Generate report — Produces a Markdown JIRA comment with environment, fix details, test results, and conclusion. Optionally posts it to the RHEL ticket via MCP.
Scripts (in scripts/):
verify-cluster.sh— Cluster health check (OCP, nodes, pcs, etcd, RPM versions)patch-nodes.sh— RPM patching with persistent override + reboot + verificationcollect-logs.sh— Collect pacemaker/etcd logs from both nodes
Prerequisites:
- SSH access to a hypervisor running a TNF cluster
- RPM with the fix downloaded locally (typically
~/Downloads/) HYPERVISORenv var ortwo-node-toolbox/submodule available for auto-detection
Diagnose TNF cluster issues (shutdown/recovery, etcd, fencing, network, operators). Gathers live cluster state via SSH, analyzes it against a knowledge base of 7 bare metal test scenarios, and reports findings with severity classification. Read-only — never modifies cluster state.
/two-node:cluster-diagnostic
/two-node:cluster-diagnostic validate "cordon, drain, then shutdown -h 1 on each node"
/two-node:cluster-diagnostic recovery-guide standby
/two-node:cluster-diagnostic recovery-guide full-shutdown
Modes:
- diagnose (default) — SSH to cluster, gather pcs/etcd/corosync/OCP state, analyze, report
- validate — Check a proposed procedure against known failure modes
- recovery-guide — Step-by-step recovery for:
standby,full-shutdown,single-node,network-partition,power-outage,rolling-restart,after-recovery,connectivity,etcd-nospace,pending-csr,split-brain,stale-data,partition-fencing-failure
Access patterns (auto-detected):
- Dev-scripts:
HYPERVISORenv var ortwo-node-toolbox/submodule - Bare metal:
NODE_0andNODE_1env vars (direct SSH to nodes) - Optional:
BMC_0/BMC_1/BMC_USER/BMC_PASSfor Redfish power state - Optional:
KUBECONFIGfor OCP-layer diagnostics (nodes, operators, CSRs, events) - Optional:
SSH_KEYfor SSH private key path (defaults to~/.ssh/id_rsa)
Environment file: Create ~/.tnf-cluster.env to avoid exposing credentials
in the conversation and to persist settings across sessions:
export NODE_0=10.1.155.141
export NODE_1=10.1.155.142
export BMC_0=<bmc-0-hostname>
export BMC_1=<bmc-1-hostname>
export BMC_USER=<bmc-username>
export BMC_PASS=<bmc-password>
export KUBECONFIG=<path-to-kubeconfig>Then run diagnostics with: source ~/.tnf-cluster.env && bash scripts/diagnose-cluster.sh
Scripts: diagnose-cluster.sh (in scripts/)
Acknowledgements: etcd split-brain detection, CIB recovery attribute checks, and stale data recovery scenarios were adapted from etcd troubleshooting documentation authored by Fonta and Carlo in two-node-toolbox and validated on bare metal (2026-05-28).
Automated OpenShift bug reproduction for Two-Node with Arbiter (TNA) and Two-Node with Fencing (TNF) topologies.
/two-node:bug-reproducer OCPBUGS-66217
One argument: a Jira issue key. The skill handles everything else:
- Bug Analysis -- Fetches the bug from Jira (description + comments), detects topology (arbiter or fencing), classifies bug category, extracts reproduction steps, detects install method (IPI/agent/kcli), and determines the OCP version. Stops if the bug is a test issue (not a product bug) or if the dev-scripts environment cannot reproduce the conditions.
- Cluster Deployment -- Updates the dev-scripts config, uploads day-0 manifests if needed, and runs the Ansible deployment playbook. Monitors deployment every 10 minutes for early failure detection. Cleans and retries on failure (with user approval).
- Cluster Ready -- Waits for all nodes Ready, MCPs updated, and COs healthy. Detects during-install bugs. Applies day-1 manifests if needed.
- Bug Reproduction -- Executes the reproduction steps extracted from the Jira bug on the healthy cluster. This is the core phase for most bugs (post-install steps like pcs commands, node reboots, backup/restore, oc apply, etc.).
- Log Collection -- Collects category-targeted logs (etcd, fencing, MCO, NTO, networking, etc.), rsyncs locally, and generates a findings report.
The cluster is always left running after the skill completes so the user can SSH in and inspect.
Supported topologies:
- arbiter -- Two-Node with Arbiter (TNA): 2 masters + 1 arbiter node
- fencing -- Two-Node with Fencing (TNF): 2 masters with BMC-based fencing
Output:
- Logs saved to
/tmp/two-node-bug-reproduce-<BUG_ID>/ - Findings report written to
docs/<bug-id-lowercase>-findings.mdin the TNT repo (e.g.,docs/ocpbugs-66217-findings.md) - Cluster left running for manual inspection
lucaconsalvi, nhamza