fix: Optimize workflow_run_jobs pagination to prevent 502 errors (#461)

afkerhousse · web-flow · commit fd73f4e50d22 · 2025-09-17T02:09:21.000Z
## Problem

The `workflow_run_jobs` stream was experiencing pagination failures when
extracting data from large workflow runs (with 760+ job runs on one
run_id):
- `502 Server error: Bad Gateway` preventing complete data extraction
- Retrying 4 times with increasing delays (according to Singer-SDK retry
and backoff logic)
- Pagination stopping after reaching max retry due to server overload

## Testing

The issue was **server load per request**, not rate limiting or
pagination logic.
I tested the stream `WorkflowRunJobsStream` on a specific `run_id` -
that I know has 760+ run jobs - using different values of `MAX_PER_PAGE`
from 100 (hard-coded in the tap) to 50 to find a threshold.

| MAX_PER_PAGE | Pages Extracted | Result |
|-----------|----------------|---------|
| 100 jobs  | 1-3 pages      | ❌ Fails |
| 90 jobs   | 4 pages        | ❌ Fails |
| 80 jobs   | 10 pages       | ✅ Success |
| 50 jobs   | 16 pages       | ✅ Success |

## Solution

### `repository_streams.py`
- Set `MAX_PER_PAGE = 80` for `WorkflowRunJobsStream`

## Impact

- **Complete data extraction** for all workflow run sizes
- **Eliminated 502 Gateway Errors** through optimal page sizing
- **Improved reliability** of workflow run jobs data extraction
- **Optimal efficiency** with 80 jobs per page
diff --git a/tap_github/repository_streams.py b/tap_github/repository_streams.py
@@ -3042,7 +3042,7 @@ def get_child_context(self, record: dict, context: Context | None) -> dict:
 class WorkflowRunJobsStream(GitHubRestStream):
     """Defines 'workflow_run_jobs' stream."""
 
-    MAX_PER_PAGE = 100
+    MAX_PER_PAGE = 80
 
     name = "workflow_run_jobs"
     path = "/repos/{org}/{repo}/actions/runs/{run_id}/jobs"