Skip to content

Commit fd73f4e

Browse files
authored
fix: Optimize workflow_run_jobs pagination to prevent 502 errors (#461)
## Problem The `workflow_run_jobs` stream was experiencing pagination failures when extracting data from large workflow runs (with 760+ job runs on one run_id): - `502 Server error: Bad Gateway` preventing complete data extraction - Retrying 4 times with increasing delays (according to Singer-SDK retry and backoff logic) - Pagination stopping after reaching max retry due to server overload ## Testing The issue was **server load per request**, not rate limiting or pagination logic. I tested the stream `WorkflowRunJobsStream` on a specific `run_id` - that I know has 760+ run jobs - using different values of `MAX_PER_PAGE` from 100 (hard-coded in the tap) to 50 to find a threshold. | MAX_PER_PAGE | Pages Extracted | Result | |-----------|----------------|---------| | 100 jobs | 1-3 pages | ❌ Fails | | 90 jobs | 4 pages | ❌ Fails | | 80 jobs | 10 pages | ✅ Success | | 50 jobs | 16 pages | ✅ Success | ## Solution ### `repository_streams.py` - Set `MAX_PER_PAGE = 80` for `WorkflowRunJobsStream` ## Impact - **Complete data extraction** for all workflow run sizes - **Eliminated 502 Gateway Errors** through optimal page sizing - **Improved reliability** of workflow run jobs data extraction - **Optimal efficiency** with 80 jobs per page
1 parent 32a5c46 commit fd73f4e

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

tap_github/repository_streams.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3042,7 +3042,7 @@ def get_child_context(self, record: dict, context: Context | None) -> dict:
30423042
class WorkflowRunJobsStream(GitHubRestStream):
30433043
"""Defines 'workflow_run_jobs' stream."""
30443044

3045-
MAX_PER_PAGE = 100
3045+
MAX_PER_PAGE = 80
30463046

30473047
name = "workflow_run_jobs"
30483048
path = "/repos/{org}/{repo}/actions/runs/{run_id}/jobs"

0 commit comments

Comments
 (0)