Skip to content

Add GitHub issue creation on QA deployment failure#4

Merged
JWhitleyWork merged 1 commit into
mainfrom
fixes-for-objective-running
Jun 22, 2026
Merged

Add GitHub issue creation on QA deployment failure#4
JWhitleyWork merged 1 commit into
mainfrom
fixes-for-objective-running

Conversation

@shaur-k

@shaur-k shaur-k commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Problem

QA-deployment failures only surfaced via the Slack webhook — no durable record, manual triage to file anything. Two paths fail: notify-crash.py (systemd ExecStopPost on non-zero exit) and cd_objective_lib.py._fail() (objective-runner timeout / rosbridge failure).

Approach

Open a deduplicated GitHub issue alongside the existing Slack post, at both fire points. New shared bin/notify_lib.py holds slack_post() + github_issue(); both scripts delegate to it (also de-duplicates the previously-copied Slack logic).

github_issue() dedups by exact title within the qa-deployment-failure label — a recurrence bumps an occurrence counter, appends a version | machine | time | reason row, and comments, rather than opening duplicates (same model as moveit_pro's auto-flake-issue.yaml). Installed package version comes from dpkg.

Why machine-side direct REST (not repository_dispatch → workflow): a fine-grained PAT scoped to Issues: Read and write on one repo is the narrowest credential that can file an issue. repository_dispatch would require Contents: write — broader blast radius on a higher-exposure QA host.

Safety / hardening

notify_lib is best-effort and never raises (it runs on the service-stop path): every GitHub call and the github_issue boundary are guarded; slack_post too. MOVEIT_CD_ISSUE_REPO is validated against an owner/repo allowlist (blocks ../../user endpoint steering), pagination is capped (MAX_ISSUE_PAGES), per-call timeout is 10s (well under systemd TimeoutStopSec), and table cells are sanitized against Markdown injection.

Opt-in

PAT read from MOVEIT_CD_GITHUB_TOKEN; unset → issue creation skipped (non-QA machines opt out automatically). Provisioned in /etc/default/moveit-pro (crash path) and the CI user's ~/.bashrc (SSH-launched objective path) — see README.

Tests

test/test_notify_lib.py (stdlib unittest, no network): pure helpers (_next_link incl. comma-in-URL, _sanitize_cell, build_payload) + dedup branches (create / bump / re-seed) + invalid-repo reject + never-raise containment. python3 -m unittest discover -s test → 12 pass. pre-commit clean.

Note

This branch also carries the prior commit d3ae417 (fix: cd objective library not properly calling do_objective), which the notifier changes build on.

Deferred security-review P2s (with cause): Slack webhook host-locking (root-owned config; locking to hooks.slack.com breaks valid webhook variants) and a redundant top-level try in notify-crash.main() (helpers already enforce never-raise internally).

Copilot AI review requested due to automatic review settings June 22, 2026 18:16
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

A new shared Python module bin/notify_lib.py provides best-effort Slack posting and deduplicated GitHub issue creation for deployment failures. Both bin/notify-crash.py and example_scripts/cd_objective_lib.py are refactored to import from it. install.sh deploys notify_lib.py alongside the existing shared library, a unit test suite is added, and README/QUICK_START docs are updated to document both notification channels.

Changes

Shared Slack + GitHub Issue Notifier

Layer / File(s) Summary
notify_lib.py: module setup, utilities, and Slack posting
bin/notify_lib.py
Adds module docstring, configuration constants, installed_version(), build_payload(), _next_link(), _sanitize_cell(), and slack_post() as a non-raising Slack webhook caller that skips when SLACK_WEBHOOK_URL is unset.
notify_lib.py: GitHub issue deduplication logic
bin/notify_lib.py
Adds _gh_request() authenticated REST helper, _gh_list_open_issues() with pagination, and github_issue()/_do_github_issue() which create a new issue or increment an occurrence counter and append a table row and comment on repeat failures.
notify-crash.py refactored to use notify_lib
bin/notify-crash.py
Replaces inline webhook setup and three removed functions with get_crash_info(unit) that reads systemctl show output with timeout handling. main() now calls slack_post and github_issue for both real crashes and --dry-run/--send test modes.
cd_objective_lib.py refactored to use notify_lib
example_scripts/cd_objective_lib.py
Removes inline Slack HTTP implementation; imports from notify_lib. Introduces _current_objective global and _failure_title() for per-objective GitHub issue titles. _fail() now calls github_issue(). _current_objective is set at three execution points for correct issue context.
Installation, tests, and documentation
install.sh, test/test_notify_lib.py, README.md, QUICK_START.md
install.sh adds notify_lib.py to the shared-libraries deployment. Unit tests cover _next_link, _sanitize_cell, build_payload, and all github_issue dedup paths (create, update, invalid repo, exception suppression, counter re-seeding). README rewrites the failure-notifications section for both Slack and GitHub PAT configuration with dedup and occurrence tracking; QUICK_START mentions GitHub issue creation in troubleshooting.

🔕 Pre-merge checks override applied

The pre-merge checks have been overridden successfully. You can now proceed with the merge.

Overridden by @JWhitleyWork via checkbox on 2026-06-22T19:11:59.359Z.

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Human Review Check ❌ Error [IGNORED] PR handles GitHub PAT tokens and Slack webhooks (auth/secrets criteria) and modifies infrastructure/deployment configuration via systemd ExecStopPost and environment files. This PR requires review by a requested human reviewer. After review, a non-author requested reviewer should override this pre-merge check.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The pull request description comprehensively explains the problem, approach, implementation details, and testing, directly corresponding to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds durable, deduplicated GitHub issue creation (alongside existing Slack notifications) when QA deployments fail, covering both the systemd crash hook and the objective-runner failure path via a new shared notification library.

Changes:

  • Added bin/notify_lib.py with shared Slack posting and GitHub issue dedup/update logic.
  • Updated bin/notify-crash.py and example_scripts/cd_objective_lib.py to use the shared helpers and file issues on failures.
  • Added unit tests for the pure helpers and GitHub issue dedup branches; updated install/docs to deploy and configure the new behavior.

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
bin/notify_lib.py New shared Slack + GitHub issue helper library with dedup/update logic.
bin/notify-crash.py Uses shared helper to post Slack + create/update GitHub issue on non-zero exit.
example_scripts/cd_objective_lib.py Uses shared helper to notify Slack + create/update GitHub issue on objective failures.
install.sh Installs notify_lib.py alongside cd_objective_lib.py into /usr/lib/moveit-pro-scripts/.
test/test_notify_lib.py Adds network-free unit tests for pagination parsing, sanitization, payload building, and dedup branches.
README.md Documents new GitHub issue notifier behavior and configuration variables.
QUICK_START.md Updates troubleshooting guidance to mention GitHub issue creation on failure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread example_scripts/cd_objective_lib.py Outdated
@shaur-k shaur-k marked this pull request as ready for review June 22, 2026 18:38
@shaur-k shaur-k force-pushed the fixes-for-objective-running branch from eecda49 to ffedaf5 Compare June 22, 2026 18:44

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@README.md`:
- Line 82: The wording at line 82 in README.md incorrectly states that
notifications are "silently skipped" when variables are unset, but the actual
runtime behavior of slack_post() and github_issue() emits skip notices to
stderr. Update the documentation on line 82 to accurately describe that skip
notices are emitted to stderr when variables are unset, so the wording reflects
the true behavior of the notifier functions and sets correct troubleshooting
expectations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 6ac40fd7-cb2b-414e-b7c6-0e87c2d7fa5c

📥 Commits

Reviewing files that changed from the base of the PR and between 5232133 and eecda49.

📒 Files selected for processing (7)
  • QUICK_START.md
  • README.md
  • bin/notify-crash.py
  • bin/notify_lib.py
  • example_scripts/cd_objective_lib.py
  • install.sh
  • test/test_notify_lib.py

Comment thread README.md

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@bin/notify_lib.py`:
- Line 117: The urllib.request.urlopen call on this line returns a response
object that is not being closed, which can lead to resource leaks. Refactor this
line to use a context manager (with statement) around the urlopen call to ensure
the response is automatically closed, following the same pattern that is already
correctly implemented in the _gh_request() function.
- Around line 290-298: The regex patterns in both the count_match search and the
re.sub call are not anchored, allowing them to match the text **Occurrences:**
anywhere in the body instead of just the canonical counter line. If this pattern
appears earlier in the issue body (such as in a failure reason), the wrong
occurrence will be replaced. Anchor both regex patterns (in the search with
r"\*\*Occurrences:\*\*\s*(\d+)" and in the sub pattern) to match only at the
intended location by using a line or string anchor (such as $ for end-of-string
or anchoring to a specific boundary) to ensure the canonical counter line is the
one being updated.
- Around line 98-100: The dry_run print statement in the webhook URL logging
exposes the full webhook secret to logs, which is a security risk. Instead of
printing the actual webhook value directly using the webhook variable in the
f-string, modify the print statement to only indicate whether the webhook is
configured or not, without revealing the actual secret value. Check if webhook
exists and print a generic message like "SLACK_WEBHOOK_URL is set" or
"SLACK_WEBHOOK_URL not set" rather than printing the webhook variable itself.
- Around line 306-326: The code ignores the return value of the first
_gh_request() call (the PATCH request that updates the issue body) and
unconditionally proceeds to post a comment and print "Updated", even if the
PATCH failed. Capture the return value of the first _gh_request() call, check if
it was successful (verify it is not None), and only execute the second
_gh_request() call (POST to add the comment) and the print statement if the
PATCH succeeded. This ensures the occurrence table stays in sync with GitHub and
prevents posting comments when the update fails.

In `@bin/notify-crash.py`:
- Around line 87-92: The github_issue() call on line 91 uses dry_run=not
send_test, which disables dry-run mode when send_test is True, allowing test
notifications to create or bump the real crash tracking issue. Fix this by
either modifying the title parameter to include a test-specific prefix (like
"TEST: ") when this is a test run, or by changing the github_issue() call to
always pass dry_run=True instead of dry_run=not send_test when in test mode.
This ensures test notifications do not pollute the real issue tracking stream.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 22976c87-17df-4688-afa6-d8d2d8012b4e

📥 Commits

Reviewing files that changed from the base of the PR and between eecda49 and ffedaf5.

📒 Files selected for processing (7)
  • QUICK_START.md
  • README.md
  • bin/notify-crash.py
  • bin/notify_lib.py
  • example_scripts/cd_objective_lib.py
  • install.sh
  • test/test_notify_lib.py
✅ Files skipped from review due to trivial changes (1)
  • QUICK_START.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • install.sh
  • example_scripts/cd_objective_lib.py
  • test/test_notify_lib.py

Comment thread bin/notify_lib.py
Comment thread bin/notify_lib.py
Comment thread bin/notify_lib.py
Comment thread bin/notify_lib.py
Comment thread bin/notify-crash.py
@shaur-k shaur-k force-pushed the fixes-for-objective-running branch from ffedaf5 to c603c28 Compare June 22, 2026 18:53

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@bin/notify-crash.py`:
- Around line 33-48: The subprocess.run() call in the get_crash_info() function
lacks a timeout, which can cause the ExecStopPost path to hang indefinitely if
systemctl becomes unresponsive. Add a timeout parameter to the subprocess.run()
call and extend the exception handler to catch subprocess.TimeoutExpired in
addition to OSError. Handle the TimeoutExpired exception the same way as OSError
by logging a descriptive message to stderr and returning None, None to skip the
notification gracefully.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4f786592-4d79-4a89-a841-34333a8a8c49

📥 Commits

Reviewing files that changed from the base of the PR and between ffedaf5 and c603c28.

📒 Files selected for processing (7)
  • QUICK_START.md
  • README.md
  • bin/notify-crash.py
  • bin/notify_lib.py
  • example_scripts/cd_objective_lib.py
  • install.sh
  • test/test_notify_lib.py
✅ Files skipped from review due to trivial changes (1)
  • QUICK_START.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • install.sh
  • test/test_notify_lib.py
  • example_scripts/cd_objective_lib.py

Comment thread bin/notify-crash.py
Both QA-deployment failure paths already post to Slack: notify-crash.py
(systemd ExecStopPost on non-zero exit) and cd_objective_lib.py._fail()
(objective-runner timeout / rosbridge failure). This adds a deduplicated
GitHub issue alongside the Slack post, so failures land in the tracker
without manual triage.

A new shared bin/notify_lib.py holds slack_post() and github_issue();
both call sites delegate to it. github_issue() dedups by exact title
within the qa-deployment-failure label: a recurrence bumps an occurrence
counter, appends a version/machine/time/reason row, and comments, instead
of opening a duplicate. The installed package version is read from dpkg.

The helper is best-effort and never raises (it runs on the service-stop
path): all GitHub REST calls and the function boundary are guarded. The
GitHub PAT is read from MOVEIT_CD_GITHUB_TOKEN; unset => issue creation is
skipped, which is how non-QA machines opt out. MOVEIT_CD_ISSUE_REPO is
validated against an owner/repo allowlist, pagination is capped, and the
per-call timeout stays well under systemd TimeoutStopSec.

Adds test/test_notify_lib.py (stdlib unittest) covering the pure helpers
and the create / bump / re-seed dedup branches.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shaur-k shaur-k force-pushed the fixes-for-objective-running branch from c603c28 to 794dcac Compare June 22, 2026 18:59
@shaur-k shaur-k requested a review from JWhitleyWork June 22, 2026 19:03
@JWhitleyWork JWhitleyWork merged commit f3b4c28 into main Jun 22, 2026
2 checks passed
@JWhitleyWork JWhitleyWork deleted the fixes-for-objective-running branch June 22, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants