When running extract_and_build() with documents of varying sizes, the ExtractionPipeline distributes whole documents to ProcessPoolExecutor workers. The SentenceSplitter runs inside each worker, so chunking happens after distribution. This means each worker gets 1 (or few) documents, and large documents create
massive load imbalance.
Observed behavior
Batch of 16 documents with EXTRACTION_NUM_WORKERS=16:
Worker 1: 77 nodes → done in ~2 min, then IDLE
Worker 2: 90 nodes → done in ~2.5 min, then IDLE
Worker 3: 107 nodes
Worker 4: 130 nodes
Worker 5: 155 nodes
Worker 6: 320 nodes
Worker 7: 644 nodes
Worker 8: 987 nodes → done in ~25 min ← entire batch waits for this
...
The batch took ~25 min, dominated by one large document. Workers that finished early sat idle for 20+ minutes.
The node_batcher splits evenly by node count, but at this point each document is still a single node (pre-chunking). The SentenceSplitter runs inside each worker after distribution, producing wildly different chunk counts depending on document size (15KB → 77 chunks, 200KB → 987 chunks).
Environment
- graphrag-toolkit-lexical-graph==3.16.2
- ECS Fargate (16 vCPU, 32GB)
- EXTRACTION_NUM_WORKERS=16
- Documents ranging from 15KB to 200KB+
It would be great to see improvements in how work is distributed across workers so that large documents don't bottleneck the entire batch. Happy to test any changes on our end.
When running extract_and_build() with documents of varying sizes, the ExtractionPipeline distributes whole documents to ProcessPoolExecutor workers. The SentenceSplitter runs inside each worker, so chunking happens after distribution. This means each worker gets 1 (or few) documents, and large documents create
massive load imbalance.
Observed behavior
Batch of 16 documents with EXTRACTION_NUM_WORKERS=16:
Worker 1: 77 nodes → done in ~2 min, then IDLE
Worker 2: 90 nodes → done in ~2.5 min, then IDLE
Worker 3: 107 nodes
Worker 4: 130 nodes
Worker 5: 155 nodes
Worker 6: 320 nodes
Worker 7: 644 nodes
Worker 8: 987 nodes → done in ~25 min ← entire batch waits for this
...
The batch took ~25 min, dominated by one large document. Workers that finished early sat idle for 20+ minutes.
The node_batcher splits evenly by node count, but at this point each document is still a single node (pre-chunking). The SentenceSplitter runs inside each worker after distribution, producing wildly different chunk counts depending on document size (15KB → 77 chunks, 200KB → 987 chunks).
Environment
It would be great to see improvements in how work is distributed across workers so that large documents don't bottleneck the entire batch. Happy to test any changes on our end.