-
Notifications
You must be signed in to change notification settings - Fork 8
Add heuristic ingestion and community detection upgrades #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add heuristic ingestion and community detection upgrades #2
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds heuristic-based maintenance capabilities to the GraphRag indexing pipeline. It introduces intelligent text unit deduplication, token budget management, orphan entity linking, and relationship enhancement to improve graph quality and reduce redundancy during ingestion.
- Introduces a new
HeuristicMaintenanceWorkflowin the indexing pipeline - Adds semantic deduplication and token budget filtering for text units
- Implements orphan entity linking and relationship enhancement using co-occurrence heuristics
- Adds Fast Label Propagation as an alternative community detection algorithm
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| ServiceCollectionExtensions.cs | Registers the new HeuristicMaintenanceWorkflow in the DI container |
| HeuristicMaintenanceWorkflow.cs | New workflow that applies heuristics to text units between chunking and graph extraction |
| ExtractGraphWorkflow.cs | Integrates graph extraction heuristics to enhance relationships and link orphan entities |
| CreateBaseTextUnitsWorkflow.cs | Adjusts chunk overlap based on heuristic configuration |
| IndexingPipelineDefinitions.cs | Adds HeuristicMaintenanceWorkflow to the default pipeline sequence |
| TextUnitHeuristicProcessor.cs | Implements token budget filtering and semantic deduplication logic |
| GraphExtractionHeuristics.cs | Implements relationship enhancement and orphan entity linking |
| HeuristicMaintenanceConfig.cs | Configuration class defining heuristic parameters and defaults |
| GraphRagConfig.cs | Adds Heuristics property to main configuration |
| Enums.cs | Adds CommunityDetectionAlgorithm enum |
| ClusterGraphConfig.cs | Adds algorithm selection and max iterations configuration |
| FastLabelPropagationCommunityDetector.cs | Implements fast label propagation algorithm for community detection |
| CommunityBuilder.cs | Refactored to support multiple community detection algorithms |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var chunkingConfig = config.Chunks; | ||
| var heuristicConfig = config.Heuristics ?? new HeuristicMaintenanceConfig(); | ||
| if (heuristicConfig.MinimumChunkOverlap > 0 && chunkingConfig.Overlap < heuristicConfig.MinimumChunkOverlap) | ||
| { | ||
| chunkingConfig.Overlap = heuristicConfig.MinimumChunkOverlap; | ||
| } |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mutating the shared chunkingConfig object can cause side effects across multiple workflow invocations since config.Chunks may be reused. Consider creating a copy of the config before modification or using a local variable to hold the effective overlap value that gets applied in CreateEffectiveConfig.
| namespace GraphRag.Config; | ||
|
|
||
| /// <summary> | ||
| /// Represents heuristic controls that fine-tune ingestion and graph maintenance behaviour. |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'behaviour' to 'behavior'.
| /// Represents heuristic controls that fine-tune ingestion and graph maintenance behaviour. | |
| /// Represents heuristic controls that fine-tune ingestion and graph maintenance behavior. |
| public void Update(TextUnitRecord incoming) | ||
| { | ||
| var mergedDocuments = MergeLists(Record.DocumentIds, incoming.DocumentIds); | ||
| var tokenCount = Math.Min(Record.TokenCount, incoming.TokenCount); |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When merging duplicate text units, taking the minimum token count may not accurately represent the merged record. Consider documenting why the minimum is chosen rather than the average or keeping the original record's token count, as this choice affects downstream token budget calculations.
| var tokenCount = Math.Min(Record.TokenCount, incoming.TokenCount); | |
| // Use the sum of token counts to more accurately represent the merged record's token budget. | |
| var tokenCount = Record.TokenCount + incoming.TokenCount; |
| return incoming; | ||
| } | ||
|
|
||
| return incoming.Length < existing.Length ? incoming : existing; |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for selecting the shorter description when merging relationships lacks documentation explaining why shorter is preferred. This heuristic choice should be documented to explain the rationale (e.g., shorter descriptions are typically more concise or less likely to contain noise).
| foreach (var neighbor in orderedNeighbors) | ||
| { | ||
| if (visited.Add(neighbor)) | ||
| { | ||
| queue.Enqueue(neighbor); | ||
| } |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.
| foreach (var neighbor in orderedNeighbors) | |
| { | |
| if (visited.Add(neighbor)) | |
| { | |
| queue.Enqueue(neighbor); | |
| } | |
| foreach (var neighbor in orderedNeighbors.Where(neighbor => visited.Add(neighbor))) | |
| { | |
| queue.Enqueue(neighbor); |
| foreach (var textUnit in seed.TextUnitIds) | ||
| { | ||
| if (!string.IsNullOrWhiteSpace(textUnit)) | ||
| { | ||
| _textUnits.Add(textUnit); | ||
| } | ||
| } |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.
| foreach (var item in source) | ||
| { | ||
| if (seen.Add(item)) | ||
| { | ||
| result.Add(item); | ||
| } | ||
| } |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.
| foreach (var item in first) | ||
| { | ||
| if (seen.Add(item)) | ||
| { | ||
| merged.Add(item); | ||
| } | ||
| } | ||
|
|
||
| foreach (var item in second) | ||
| { | ||
| if (seen.Add(item)) | ||
| { | ||
| merged.Add(item); | ||
| } | ||
| } | ||
|
|
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.
| foreach (var item in first) | |
| { | |
| if (seen.Add(item)) | |
| { | |
| merged.Add(item); | |
| } | |
| } | |
| foreach (var item in second) | |
| { | |
| if (seen.Add(item)) | |
| { | |
| merged.Add(item); | |
| } | |
| } | |
| foreach (var item in first.Concat(second).Where(item => seen.Add(item))) | |
| { | |
| merged.Add(item); | |
| } |
| foreach (var item in second) | ||
| { | ||
| if (seen.Add(item)) | ||
| { | ||
| merged.Add(item); | ||
| } | ||
| } |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.
| catch (Exception ex) | ||
| { | ||
| logger?.LogWarning(ex, "Failed to execute semantic deduplication heuristics. Retaining filtered text units only."); | ||
| return filtered; | ||
| } |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic catch clause.
Summary
Testing
https://chatgpt.com/codex/tasks/task_e_6903ed1e62bc8326827b5c86e8f6f5de