Add GitHub issue creation on QA deployment failure#4
Conversation
📝 WalkthroughWalkthroughA new shared Python module ChangesShared Slack + GitHub Issue Notifier
🔕 Pre-merge checks override appliedThe pre-merge checks have been overridden successfully. You can now proceed with the merge. Overridden by ❌ Failed checks (1 error)
✅ Passed checks (3 passed)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR adds durable, deduplicated GitHub issue creation (alongside existing Slack notifications) when QA deployments fail, covering both the systemd crash hook and the objective-runner failure path via a new shared notification library.
Changes:
- Added
bin/notify_lib.pywith shared Slack posting and GitHub issue dedup/update logic. - Updated
bin/notify-crash.pyandexample_scripts/cd_objective_lib.pyto use the shared helpers and file issues on failures. - Added unit tests for the pure helpers and GitHub issue dedup branches; updated install/docs to deploy and configure the new behavior.
Reviewed changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
bin/notify_lib.py |
New shared Slack + GitHub issue helper library with dedup/update logic. |
bin/notify-crash.py |
Uses shared helper to post Slack + create/update GitHub issue on non-zero exit. |
example_scripts/cd_objective_lib.py |
Uses shared helper to notify Slack + create/update GitHub issue on objective failures. |
install.sh |
Installs notify_lib.py alongside cd_objective_lib.py into /usr/lib/moveit-pro-scripts/. |
test/test_notify_lib.py |
Adds network-free unit tests for pagination parsing, sanitization, payload building, and dedup branches. |
README.md |
Documents new GitHub issue notifier behavior and configuration variables. |
QUICK_START.md |
Updates troubleshooting guidance to mention GitHub issue creation on failure. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
eecda49 to
ffedaf5
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@README.md`:
- Line 82: The wording at line 82 in README.md incorrectly states that
notifications are "silently skipped" when variables are unset, but the actual
runtime behavior of slack_post() and github_issue() emits skip notices to
stderr. Update the documentation on line 82 to accurately describe that skip
notices are emitted to stderr when variables are unset, so the wording reflects
the true behavior of the notifier functions and sets correct troubleshooting
expectations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 6ac40fd7-cb2b-414e-b7c6-0e87c2d7fa5c
📒 Files selected for processing (7)
QUICK_START.mdREADME.mdbin/notify-crash.pybin/notify_lib.pyexample_scripts/cd_objective_lib.pyinstall.shtest/test_notify_lib.py
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@bin/notify_lib.py`:
- Line 117: The urllib.request.urlopen call on this line returns a response
object that is not being closed, which can lead to resource leaks. Refactor this
line to use a context manager (with statement) around the urlopen call to ensure
the response is automatically closed, following the same pattern that is already
correctly implemented in the _gh_request() function.
- Around line 290-298: The regex patterns in both the count_match search and the
re.sub call are not anchored, allowing them to match the text **Occurrences:**
anywhere in the body instead of just the canonical counter line. If this pattern
appears earlier in the issue body (such as in a failure reason), the wrong
occurrence will be replaced. Anchor both regex patterns (in the search with
r"\*\*Occurrences:\*\*\s*(\d+)" and in the sub pattern) to match only at the
intended location by using a line or string anchor (such as $ for end-of-string
or anchoring to a specific boundary) to ensure the canonical counter line is the
one being updated.
- Around line 98-100: The dry_run print statement in the webhook URL logging
exposes the full webhook secret to logs, which is a security risk. Instead of
printing the actual webhook value directly using the webhook variable in the
f-string, modify the print statement to only indicate whether the webhook is
configured or not, without revealing the actual secret value. Check if webhook
exists and print a generic message like "SLACK_WEBHOOK_URL is set" or
"SLACK_WEBHOOK_URL not set" rather than printing the webhook variable itself.
- Around line 306-326: The code ignores the return value of the first
_gh_request() call (the PATCH request that updates the issue body) and
unconditionally proceeds to post a comment and print "Updated", even if the
PATCH failed. Capture the return value of the first _gh_request() call, check if
it was successful (verify it is not None), and only execute the second
_gh_request() call (POST to add the comment) and the print statement if the
PATCH succeeded. This ensures the occurrence table stays in sync with GitHub and
prevents posting comments when the update fails.
In `@bin/notify-crash.py`:
- Around line 87-92: The github_issue() call on line 91 uses dry_run=not
send_test, which disables dry-run mode when send_test is True, allowing test
notifications to create or bump the real crash tracking issue. Fix this by
either modifying the title parameter to include a test-specific prefix (like
"TEST: ") when this is a test run, or by changing the github_issue() call to
always pass dry_run=True instead of dry_run=not send_test when in test mode.
This ensures test notifications do not pollute the real issue tracking stream.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 22976c87-17df-4688-afa6-d8d2d8012b4e
📒 Files selected for processing (7)
QUICK_START.mdREADME.mdbin/notify-crash.pybin/notify_lib.pyexample_scripts/cd_objective_lib.pyinstall.shtest/test_notify_lib.py
✅ Files skipped from review due to trivial changes (1)
- QUICK_START.md
🚧 Files skipped from review as they are similar to previous changes (3)
- install.sh
- example_scripts/cd_objective_lib.py
- test/test_notify_lib.py
ffedaf5 to
c603c28
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@bin/notify-crash.py`:
- Around line 33-48: The subprocess.run() call in the get_crash_info() function
lacks a timeout, which can cause the ExecStopPost path to hang indefinitely if
systemctl becomes unresponsive. Add a timeout parameter to the subprocess.run()
call and extend the exception handler to catch subprocess.TimeoutExpired in
addition to OSError. Handle the TimeoutExpired exception the same way as OSError
by logging a descriptive message to stderr and returning None, None to skip the
notification gracefully.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 4f786592-4d79-4a89-a841-34333a8a8c49
📒 Files selected for processing (7)
QUICK_START.mdREADME.mdbin/notify-crash.pybin/notify_lib.pyexample_scripts/cd_objective_lib.pyinstall.shtest/test_notify_lib.py
✅ Files skipped from review due to trivial changes (1)
- QUICK_START.md
🚧 Files skipped from review as they are similar to previous changes (3)
- install.sh
- test/test_notify_lib.py
- example_scripts/cd_objective_lib.py
Both QA-deployment failure paths already post to Slack: notify-crash.py (systemd ExecStopPost on non-zero exit) and cd_objective_lib.py._fail() (objective-runner timeout / rosbridge failure). This adds a deduplicated GitHub issue alongside the Slack post, so failures land in the tracker without manual triage. A new shared bin/notify_lib.py holds slack_post() and github_issue(); both call sites delegate to it. github_issue() dedups by exact title within the qa-deployment-failure label: a recurrence bumps an occurrence counter, appends a version/machine/time/reason row, and comments, instead of opening a duplicate. The installed package version is read from dpkg. The helper is best-effort and never raises (it runs on the service-stop path): all GitHub REST calls and the function boundary are guarded. The GitHub PAT is read from MOVEIT_CD_GITHUB_TOKEN; unset => issue creation is skipped, which is how non-QA machines opt out. MOVEIT_CD_ISSUE_REPO is validated against an owner/repo allowlist, pagination is capped, and the per-call timeout stays well under systemd TimeoutStopSec. Adds test/test_notify_lib.py (stdlib unittest) covering the pure helpers and the create / bump / re-seed dedup branches. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c603c28 to
794dcac
Compare
Problem
QA-deployment failures only surfaced via the Slack webhook — no durable record, manual triage to file anything. Two paths fail:
notify-crash.py(systemdExecStopPoston non-zero exit) andcd_objective_lib.py._fail()(objective-runner timeout / rosbridge failure).Approach
Open a deduplicated GitHub issue alongside the existing Slack post, at both fire points. New shared
bin/notify_lib.pyholdsslack_post()+github_issue(); both scripts delegate to it (also de-duplicates the previously-copied Slack logic).github_issue()dedups by exact title within theqa-deployment-failurelabel — a recurrence bumps an occurrence counter, appends aversion | machine | time | reasonrow, and comments, rather than opening duplicates (same model as moveit_pro'sauto-flake-issue.yaml). Installed package version comes fromdpkg.Why machine-side direct REST (not
repository_dispatch→ workflow): a fine-grained PAT scoped toIssues: Read and writeon one repo is the narrowest credential that can file an issue.repository_dispatchwould requireContents: write— broader blast radius on a higher-exposure QA host.Safety / hardening
notify_libis best-effort and never raises (it runs on the service-stop path): every GitHub call and thegithub_issueboundary are guarded;slack_posttoo.MOVEIT_CD_ISSUE_REPOis validated against anowner/repoallowlist (blocks../../userendpoint steering), pagination is capped (MAX_ISSUE_PAGES), per-call timeout is 10s (well under systemdTimeoutStopSec), and table cells are sanitized against Markdown injection.Opt-in
PAT read from
MOVEIT_CD_GITHUB_TOKEN; unset → issue creation skipped (non-QA machines opt out automatically). Provisioned in/etc/default/moveit-pro(crash path) and the CI user's~/.bashrc(SSH-launched objective path) — see README.Tests
test/test_notify_lib.py(stdlibunittest, no network): pure helpers (_next_linkincl. comma-in-URL,_sanitize_cell,build_payload) + dedup branches (create / bump / re-seed) + invalid-repo reject + never-raise containment.python3 -m unittest discover -s test→ 12 pass.pre-commitclean.Note
This branch also carries the prior commit
d3ae417(fix: cd objective library not properly calling do_objective), which the notifier changes build on.Deferred security-review P2s (with cause): Slack webhook host-locking (root-owned config; locking to
hooks.slack.combreaks valid webhook variants) and a redundant top-leveltryinnotify-crash.main()(helpers already enforce never-raise internally).