Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion QUICK_START.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,4 +43,4 @@ To wrap your own objective into a forever-loop driver, copy `example_scripts/cd_
- `sudo -n` prompts for password → sudoers drop-in missing. Re-run `sudo ./install.sh`.
- `Invalid version format` → version string does not match `X.Y.Z` or `X.Y.Z-rcN`.
- 404 from curl → version does not exist on `download.picknik.ai`.
- Service fails to start → check `journalctl -u moveit-pro@$USER.service -e`. If `SLACK_WEBHOOK_URL` is set in `/etc/default/moveit-pro`, `notify-crash.py` will also post to Slack.
- Service fails to start → check `journalctl -u moveit-pro@$USER.service -e`. If `SLACK_WEBHOOK_URL` / `MOVEIT_CD_GITHUB_TOKEN` are set in `/etc/default/moveit-pro`, `notify-crash.py` also posts to Slack and opens a GitHub issue.
23 changes: 18 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,10 @@ The full setup walkthrough lives at [Set Up CI/CD](https://docs.picknik.ai/how_t
- `install.sh` — one-shot installer. Copies the wrapper, systemd unit, and sudoers drop-in into place. Run on each target machine.
- `bin/install-moveit-pro` — root-owned installer wrapper. Validates the version string against a strict regex, downloads the `.deb` to a root-owned cache, installs it, and deletes the file.
- `bin/moveit-pro@.service` — systemd template unit. Runs `moveit_pro run --no-browser` as `%i`. Restarts on failure. Reads optional environment from `/etc/default/moveit-pro`.
- `bin/notify-crash.py` — posts to Slack via `ExecStopPost` when the service exits non-zero. Reads `SLACK_WEBHOOK_URL` from the environment; if unset, the notification is skipped.
- `bin/notify-crash.py` — posts to Slack and opens/updates a GitHub issue via `ExecStopPost` when the service exits non-zero. Reads `SLACK_WEBHOOK_URL` and `MOVEIT_CD_GITHUB_TOKEN` from the environment; each notification is skipped if its variable is unset.
- `bin/notify_lib.py` — shared notification helpers (`slack_post`, `github_issue`) used by both `notify-crash.py` and `cd_objective_lib.py`. Installed to `/usr/lib/moveit-pro-scripts/`. `github_issue` deduplicates by exact title within a label: a repeated failure bumps an occurrence counter and appends a row instead of opening a new issue.
- `bin/ci-runner.sudoers.template` — sudoers drop-in. `install.sh` substitutes `__CI_USER__` with the local account and installs at `/etc/sudoers.d/<user>-ci`. Grants NOPASSWD on the installer and the user's own systemd unit only.
- `example_scripts/cd_objective_lib.py` — helper library for sending an Objective goal via rosbridge, used by the example scripts.
- `example_scripts/cd_objective_lib.py` — helper library for sending an Objective goal via rosbridge, used by the example scripts. On objective timeout or rosbridge failure it posts to Slack, opens/updates a GitHub issue, and stops the systemd unit (via `notify_lib.py`).
- `example_scripts/3-waypoint-pick-and-place.py`, `example_scripts/ml-segment-image.py`, `example_scripts/move-all-boxes.py` — example smoke-test scripts that drive an Objective on `localhost:3201` rosbridge.

## Install
Expand All @@ -27,6 +28,7 @@ sudo ./install.sh
This installs:

- The objective scripts to `/usr/bin/`.
- `cd_objective_lib.py` and `notify_lib.py` to `/usr/lib/moveit-pro-scripts/`.
- `notify-crash.py` to `/usr/bin/`.
- `install-moveit-pro` to `/usr/local/sbin/` (root-owned, `0755`).
- `/var/cache/moveit-pro/` as a root-owned download cache.
Expand Down Expand Up @@ -60,17 +62,28 @@ WORKSPACE_PIN_TO_RELEASE=false

`WORKSPACE_REPO` is regex-restricted to `https://github.com/<owner>/<repo>.git` or `git@github.com:<owner>/<repo>.git`. For the SSH form, the CI user needs a deploy key with read-only access.

### Optional: Slack crash notifications
### Optional: failure notifications (Slack + GitHub issues)

Set `SLACK_WEBHOOK_URL` in `/etc/default/moveit-pro` (root-owned). The systemd unit reads this file via `EnvironmentFile=`, so `notify-crash.py` and `cd_objective_lib.py` will post crash and CD-failure events to the webhook:
Both notifiers read their config from `/etc/default/moveit-pro` (root-owned). The systemd unit loads this file via `EnvironmentFile=`, so `notify-crash.py` and `cd_objective_lib.py` pick it up for crash and CD-failure events. Each notifier is independent: set only the variables you want.

```bash
sudo install -m 0640 -o root -g root /dev/stdin /etc/default/moveit-pro <<'EOF'
# Slack incoming webhook. Unset -> Slack skipped.
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/XXX/YYY/ZZZ

# GitHub issue on failure. Unset -> issue creation skipped.
MOVEIT_CD_GITHUB_TOKEN=github_pat_xxx
# Optional overrides (defaults shown):
# MOVEIT_CD_ISSUE_REPO=PickNikRobotics/moveit_pro
# MOVEIT_CD_ISSUE_LABEL=qa-deployment-failure
EOF
```

If the variable is unset, notifications are silently skipped.
If a variable is unset, that notification is silently skipped — this is how non-QA machines opt out of issue creation.
Comment thread
coderabbitai[bot] marked this conversation as resolved.

`MOVEIT_CD_GITHUB_TOKEN` must be a **fine-grained PAT scoped to the issue repo with `Issues: Read and write` and nothing else** — the narrowest credential that can file an issue. Do not grant `Contents` or any other scope: a QA machine is a higher-exposure host, and the token only needs to open and comment on issues. The `qa-deployment-failure` label must already exist on the repo (the API does not create labels on demand).

Repeated failures of the same kind on the same machine deduplicate to a single issue (matched by title within the label) — each recurrence bumps an occurrence counter, appends a table row with the version/time/reason, and adds a comment for visibility.

## Verify the install

Expand Down
127 changes: 67 additions & 60 deletions bin/notify-crash.py
Original file line number Diff line number Diff line change
@@ -1,27 +1,56 @@
#!/usr/bin/env python3

import json
import os
import socket
import subprocess
import sys
import urllib.request

WEBHOOK_URL = os.environ.get("SLACK_WEBHOOK_URL", "")


def get_payload_from_systemd(unit):
# Check if the service exited with a failure.
result = subprocess.run(
[
"systemctl",
"show",
unit,
"--property=ExecMainStatus,ActiveEnterTimestamp,ActiveExitTimestamp",
],
capture_output=True,
text=True,
)
sys.path.insert(0, "/usr/lib/moveit-pro-scripts")

try:
from notify_lib import build_payload, github_issue, slack_post
except ImportError as exc:
# ExecStopPost must never fail the service stop because a helper is missing.
print(f"notify_lib unavailable, notifications disabled: {exc}", file=sys.stderr)

def build_payload(process_time, date=None):
return {"process_time": process_time}

def slack_post(payload, dry_run=False):
pass

def github_issue(title, reason, version=None, dry_run=False):
pass


# Bound the systemctl query so a hung call can't stall the service-stop path.
SYSTEMCTL_TIMEOUT_S = 10


def get_crash_info(unit):
"""Return (payload, reason) for a non-zero service exit, or (None, None).

`payload` feeds Slack; `reason` is the human summary recorded on the
GitHub issue. A clean exit (status 0) returns (None, None) so a normal
`systemctl stop` does not notify.
"""
try:
result = subprocess.run(
[
"systemctl",
"show",
unit,
"--property=ExecMainStatus,ActiveEnterTimestamp,ActiveExitTimestamp",
],
capture_output=True,
text=True,
check=False,
timeout=SYSTEMCTL_TIMEOUT_S,
)
except (OSError, subprocess.TimeoutExpired) as exc:
# systemctl missing (container/test VM) or hung. Never block or raise
# on the ExecStopPost path; just skip notification.
print(f"systemctl unavailable, skipping crash notify: {exc}", file=sys.stderr)
return None, None
Comment thread
coderabbitai[bot] marked this conversation as resolved.

props = {}
for line in result.stdout.strip().splitlines():
Expand All @@ -30,7 +59,7 @@ def get_payload_from_systemd(unit):

exit_code = props.get("ExecMainStatus", "0")
if exit_code == "0":
return None
return None, None

crash_time = props.get("ActiveExitTimestamp", "unknown")
start_time = props.get("ActiveEnterTimestamp", "unknown")
Expand All @@ -48,55 +77,33 @@ def get_payload_from_systemd(unit):
else:
process_time = "unknown"

return {
"date": crash_time,
"laptop_name": socket.gethostname(),
"process_time": process_time,
}


def get_dummy_payload():
return {
"date": "Sun 2026-04-13 14:19:47 MDT",
"laptop_name": socket.gethostname(),
"process_time": "2:34:12",
}


def send(payload, dry_run=False):
data = json.dumps(payload).encode()

if dry_run:
print(f"POST {WEBHOOK_URL or '<SLACK_WEBHOOK_URL not set>'}")
print(json.dumps(payload, indent=2))
return

if not WEBHOOK_URL:
print("SLACK_WEBHOOK_URL not set; skipping notification", file=sys.stderr)
return

req = urllib.request.Request(
WEBHOOK_URL,
data=data,
headers={"Content-Type": "application/json"},
)
urllib.request.urlopen(req)
payload = build_payload(process_time, date=crash_time)
reason = f"Service {unit} exited with status {exit_code} (uptime {process_time})"
return payload, reason


def main():
dry_run = "--dry-run" in sys.argv
send_test = "--send" in sys.argv
args = [a for a in sys.argv[1:] if a not in ("--dry-run", "--send")]
unit = args[0] if args else "moveit-pro@unknown"
title = f"QA deployment crash: {socket.gethostname()}"

if dry_run or send_test:
payload = get_dummy_payload()
send(payload, dry_run=not send_test)
else:
unit = args[0] if args else "moveit-pro@unknown"
payload = get_payload_from_systemd(unit)
if payload is None:
return
send(payload)
payload = build_payload("2:34:12", date="Sun 2026-04-13 14:19:47 MDT")
reason = f"Test crash notification for {unit}"
slack_post(payload, dry_run=not send_test)
# Distinct title so a --send test never dedupes into the real crash
# issue stream.
test_title = f"QA deployment crash test: {socket.gethostname()}"
github_issue(test_title, reason, dry_run=not send_test)
return
Comment thread
coderabbitai[bot] marked this conversation as resolved.

payload, reason = get_crash_info(unit)
if payload is None:
return
slack_post(payload)
github_issue(title, reason)


if __name__ == "__main__":
Expand Down
Loading
Loading