Skip to content

Conversation

@KSemenenko
Copy link
Member

Summary

  • add heuristic maintenance configuration plus a new ingestion maintenance workflow that trims token budgets, deduplicates text units, and enhances relationships
  • enforce minimum chunk overlap and apply relationship/orphan heuristics during graph extraction
  • switch community detection to fast label propagation with configurable iterations and register the workflow in the indexing pipeline

Testing

  • /root/.dotnet/dotnet build GraphRag.slnx
  • /root/.dotnet/dotnet test GraphRag.slnx (fails: Docker endpoint unavailable in CI environment)
  • /root/.dotnet/dotnet format GraphRag.slnx

https://chatgpt.com/codex/tasks/task_e_6903ed1e62bc8326827b5c86e8f6f5de

Copilot AI review requested due to automatic review settings October 31, 2025 09:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds heuristic-based maintenance capabilities to the GraphRag indexing pipeline. It introduces intelligent text unit deduplication, token budget management, orphan entity linking, and relationship enhancement to improve graph quality and reduce redundancy during ingestion.

  • Introduces a new HeuristicMaintenanceWorkflow in the indexing pipeline
  • Adds semantic deduplication and token budget filtering for text units
  • Implements orphan entity linking and relationship enhancement using co-occurrence heuristics
  • Adds Fast Label Propagation as an alternative community detection algorithm

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
ServiceCollectionExtensions.cs Registers the new HeuristicMaintenanceWorkflow in the DI container
HeuristicMaintenanceWorkflow.cs New workflow that applies heuristics to text units between chunking and graph extraction
ExtractGraphWorkflow.cs Integrates graph extraction heuristics to enhance relationships and link orphan entities
CreateBaseTextUnitsWorkflow.cs Adjusts chunk overlap based on heuristic configuration
IndexingPipelineDefinitions.cs Adds HeuristicMaintenanceWorkflow to the default pipeline sequence
TextUnitHeuristicProcessor.cs Implements token budget filtering and semantic deduplication logic
GraphExtractionHeuristics.cs Implements relationship enhancement and orphan entity linking
HeuristicMaintenanceConfig.cs Configuration class defining heuristic parameters and defaults
GraphRagConfig.cs Adds Heuristics property to main configuration
Enums.cs Adds CommunityDetectionAlgorithm enum
ClusterGraphConfig.cs Adds algorithm selection and max iterations configuration
FastLabelPropagationCommunityDetector.cs Implements fast label propagation algorithm for community detection
CommunityBuilder.cs Refactored to support multiple community detection algorithms

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 27 to 32
var chunkingConfig = config.Chunks;
var heuristicConfig = config.Heuristics ?? new HeuristicMaintenanceConfig();
if (heuristicConfig.MinimumChunkOverlap > 0 && chunkingConfig.Overlap < heuristicConfig.MinimumChunkOverlap)
{
chunkingConfig.Overlap = heuristicConfig.MinimumChunkOverlap;
}
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutating the shared chunkingConfig object can cause side effects across multiple workflow invocations since config.Chunks may be reused. Consider creating a copy of the config before modification or using a local variable to hold the effective overlap value that gets applied in CreateEffectiveConfig.

Copilot uses AI. Check for mistakes.
namespace GraphRag.Config;

/// <summary>
/// Represents heuristic controls that fine-tune ingestion and graph maintenance behaviour.
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'behaviour' to 'behavior'.

Suggested change
/// Represents heuristic controls that fine-tune ingestion and graph maintenance behaviour.
/// Represents heuristic controls that fine-tune ingestion and graph maintenance behavior.

Copilot uses AI. Check for mistakes.
public void Update(TextUnitRecord incoming)
{
var mergedDocuments = MergeLists(Record.DocumentIds, incoming.DocumentIds);
var tokenCount = Math.Min(Record.TokenCount, incoming.TokenCount);
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When merging duplicate text units, taking the minimum token count may not accurately represent the merged record. Consider documenting why the minimum is chosen rather than the average or keeping the original record's token count, as this choice affects downstream token budget calculations.

Suggested change
var tokenCount = Math.Min(Record.TokenCount, incoming.TokenCount);
// Use the sum of token counts to more accurately represent the merged record's token budget.
var tokenCount = Record.TokenCount + incoming.TokenCount;

Copilot uses AI. Check for mistakes.
return incoming;
}

return incoming.Length < existing.Length ? incoming : existing;
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for selecting the shorter description when merging relationships lacks documentation explaining why shorter is preferred. This heuristic choice should be documented to explain the rationale (e.g., shorter descriptions are typically more concise or less likely to contain noise).

Copilot uses AI. Check for mistakes.
Comment on lines 184 to 189
foreach (var neighbor in orderedNeighbors)
{
if (visited.Add(neighbor))
{
queue.Enqueue(neighbor);
}
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Suggested change
foreach (var neighbor in orderedNeighbors)
{
if (visited.Add(neighbor))
{
queue.Enqueue(neighbor);
}
foreach (var neighbor in orderedNeighbors.Where(neighbor => visited.Add(neighbor)))
{
queue.Enqueue(neighbor);

Copilot uses AI. Check for mistakes.
Comment on lines 232 to 238
foreach (var textUnit in seed.TextUnitIds)
{
if (!string.IsNullOrWhiteSpace(textUnit))
{
_textUnits.Add(textUnit);
}
}
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Copilot uses AI. Check for mistakes.
Comment on lines 203 to 209
foreach (var item in source)
{
if (seen.Add(item))
{
result.Add(item);
}
}
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Copilot uses AI. Check for mistakes.
Comment on lines 265 to 280
foreach (var item in first)
{
if (seen.Add(item))
{
merged.Add(item);
}
}

foreach (var item in second)
{
if (seen.Add(item))
{
merged.Add(item);
}
}

Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Suggested change
foreach (var item in first)
{
if (seen.Add(item))
{
merged.Add(item);
}
}
foreach (var item in second)
{
if (seen.Add(item))
{
merged.Add(item);
}
}
foreach (var item in first.Concat(second).Where(item => seen.Add(item)))
{
merged.Add(item);
}

Copilot uses AI. Check for mistakes.
Comment on lines 273 to 279
foreach (var item in second)
{
if (seen.Add(item))
{
merged.Add(item);
}
}
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +51
catch (Exception ex)
{
logger?.LogWarning(ex, "Failed to execute semantic deduplication heuristics. Retaining filtered text units only.");
return filtered;
}
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic catch clause.

Copilot uses AI. Check for mistakes.
@KSemenenko KSemenenko merged commit d481406 into main Oct 31, 2025
3 checks passed
@KSemenenko KSemenenko deleted the codex/analyze-graphrag.net-features-and-gaps branch October 31, 2025 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants