diff --git a/AGENTS.md b/AGENTS.md
index 3ed5071350..0b315f8193 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -7,6 +7,7 @@
- Keep default prompts in static C# classes; do not rely on prompt files under `prompts/` for built-in templates.
- Register language models through Microsoft.Extensions.AI keyed services; avoid bespoke `LanguageModelConfig` providers.
- Always run `dotnet format GraphRag.slnx` before finishing work.
+- Always run `dotnet test GraphRag.slnx` before finishing work, after building.
# Conversations
any resulting updates to agents.md should go under the section "## Rules to follow"
diff --git a/Directory.Build.props b/Directory.Build.props
index 58057d740e..21fceb807b 100644
--- a/Directory.Build.props
+++ b/Directory.Build.props
@@ -25,8 +25,8 @@
https://github.com/managedcode/graphrag
https://github.com/managedcode/graphrag
Managed Code GraphRag
- 0.0.3
- 0.0.3
+ 0.0.4
+ 0.0.4
diff --git a/README.md b/README.md
index e0afbcbeab..9c7a373c00 100644
--- a/README.md
+++ b/README.md
@@ -1,8 +1,19 @@
# GraphRAG for .NET
-GraphRAG for .NET is a ground-up port of Microsoft’s original GraphRAG Python reference implementation to the modern .NET 9 stack.
-Our goal is API parity with the Python pipeline while embracing native .NET idioms (dependency injection, logging abstractions, async I/O, etc.).
-The upstream Python code remains available in `submodules/graphrag-python` for side-by-side reference during the migration.
+GraphRAG for .NET is a ground-up port of Microsoft’s GraphRAG reference implementation to the modern .NET 9 stack. The port keeps parity with the original Python pipelines while embracing native .NET idioms—dependency injection, logging abstractions, async I/O, and strongly-typed configuration.
+
+> ℹ️ The upstream Python code remains available under [`submodules/graphrag-python`](submodules/graphrag-python) for side-by-side reference. Treat it as read-only unless a task explicitly targets the submodule.
+
+---
+
+## Feature Highlights
+
+- **End-to-end indexing workflows.** All standard GraphRAG stages—document loading, chunking, graph extraction, community building, and summarisation—ship as discrete workflows that can be registered with a single `AddGraphRag(...)` call.
+- **Heuristic ingestion & maintenance.** Built-in overlapping chunk windows, semantic deduplication, orphan-node linking, relationship enhancement/validation, and token-budget trimming keep your graph clean without bespoke services.
+- **Fast label propagation communities.** A configurable fast label propagation detector (with connected-component fallback) mirrors the behaviour of the GraphRag.Net demo directly inside the pipeline.
+- **Pluggable graph stores.** Ready-made adapters for Azure Cosmos DB, Neo4j, and Apache AGE/PostgreSQL conform to `IGraphStore` so you can swap back-ends without touching workflows.
+- **Prompt orchestration.** Prompt templates cascade through manual, auto-tuned, and default sources using [Microsoft.Extensions.AI](https://learn.microsoft.com/dotnet/ai/overview) keyed clients for chat and embedding models.
+- **Deterministic integration tests.** Testcontainers spin up the real databases, while stub embeddings provide stable heuristics coverage so CI can validate the full pipeline.
---
@@ -10,148 +21,195 @@ The upstream Python code remains available in `submodules/graphrag-python` for s
```
graphrag/
-├── GraphRag.slnx # Single solution covering every project
+├── GraphRag.slnx # Solution spanning runtime + test projects
├── Directory.Build.props / Directory.Packages.props
├── src/
│ ├── ManagedCode.GraphRag # Core pipeline orchestration & abstractions
│ ├── ManagedCode.GraphRag.CosmosDb # Azure Cosmos DB graph adapter
-│ ├── ManagedCode.GraphRag.Neo4j # Neo4j adapter & bolt client integration
-│ └── ManagedCode.GraphRag.Postgres # Apache AGE/PostgreSQL graph store adapter
+│ ├── ManagedCode.GraphRag.Neo4j # Neo4j adapter & Bolt integration
+│ └── ManagedCode.GraphRag.Postgres # Apache AGE/PostgreSQL graph adapter
├── tests/
│ └── ManagedCode.GraphRag.Tests
-│ ├── Integration/ # Live container-backed scenarios (Testcontainers)
+│ ├── Integration/ # Live container-backed scenarios
│ └── … unit-level suites
└── submodules/
- └── graphrag-python # Original Python implementation (read-only reference)
+ └── graphrag-python # Original Python implementation (read-only)
```
-### Key Components
-
-- **ManagedCode.GraphRag**
- Hosts the pipelines, workflow execution model, and shared contracts such as `IGraphStore`, `IPipelineCache`, etc.
-
-- **ManagedCode.GraphRag.Neo4j / .Postgres / .CosmosDb**
- Concrete graph-store adapters that satisfy the core abstractions. Each hides the backend-specific SDK plumbing and exposes `.AddXGraphStore(...)` DI helpers.
-
-- **ManagedCode.GraphRag.Tests**
- Our only test project.
- Unit tests ensure helper APIs behave deterministically.
- The `Integration/` folder spins up real infrastructure (Neo4j, Apache AGE/PostgreSQL, optional Cosmos) via Testcontainers—no fakes or mocks.
-
---
## Prerequisites
| Requirement | Notes |
|-------------|-------|
-| [.NET SDK 9.0](https://dotnet.microsoft.com/en-us/download/dotnet/9.0) | The solution targets `net9.0`; install previews where necessary. |
-| Docker Desktop / compatible container runtime | Required for Testcontainers-backed integration tests (Neo4j & Apache AGE/PostgreSQL). |
-| (Optional) Azure Cosmos DB Emulator | Set `COSMOS_EMULATOR_CONNECTION_STRING` to enable Cosmos tests; they are skipped when the env var is absent. |
+| [.NET SDK 9.0](https://dotnet.microsoft.com/download/dotnet/9.0) | The solution targets `net9.0`. Use the in-repo [`dotnet-install.sh`](dotnet-install.sh) helper on CI. |
+| Docker Desktop / compatible runtime | Required for Testcontainers-backed integration tests (Neo4j & Apache AGE/PostgreSQL). |
+| (Optional) Azure Cosmos DB Emulator | Set `COSMOS_EMULATOR_CONNECTION_STRING` to enable Cosmos-specific tests. |
---
-## Getting Started
+## Quick Start
-1. **Clone the repository**
+1. **Clone & initialise submodules**
```bash
git clone https://github.com//graphrag.git
cd graphrag
git submodule update --init --recursive
```
-2. **Restore & build**
+2. **Install .NET 9 if needed**
+ ```bash
+ ./dotnet-install.sh --version 9.0.100
+ export PATH="$HOME/.dotnet:$PATH"
+ ```
+
+3. **Restore & build (always build before testing)**
```bash
dotnet build GraphRag.slnx
```
- > Repository rule: always build the solution before running tests.
-3. **Run the full test suite**
+4. **Run the full test suite**
```bash
dotnet test GraphRag.slnx --logger "console;verbosity=minimal"
```
- This command will:
- - Restore packages
- - Launch Neo4j and Apache AGE/PostgreSQL containers via Testcontainers
- - Execute unit + integration tests from `ManagedCode.GraphRag.Tests`
- - Tear down containers automatically when finished
+ This command restores packages, launches Neo4j and Apache AGE/PostgreSQL containers via Testcontainers, runs unit + integration tests, and tears everything down automatically.
-4. **Limit to a specific integration area (optional)**
+5. **Target a specific scenario (optional)**
```bash
dotnet test tests/ManagedCode.GraphRag.Tests/ManagedCode.GraphRag.Tests.csproj \
- --filter "FullyQualifiedName~PostgresGraphStoreIntegrationTests" \
+ --filter "FullyQualifiedName~HeuristicMaintenanceIntegrationTests" \
--logger "console;verbosity=normal"
```
----
-
-## Integration Testing Strategy
-
-- **No fakes.** We removed the legacy fake Postgres store. Every graph operation in tests uses real services orchestrated by Testcontainers.
-- **Security coverage.** `Integration/PostgresGraphStoreIntegrationTests.cs` includes payloads that mimic SQL/Cypher injection attempts to ensure values remain literals and labels/types are strictly validated.
-- **Cross-backend validation.** `Integration/GraphStoreIntegrationTests.cs` exercises Postgres, Neo4j, and Cosmos (when available) through the shared `IGraphStore` abstraction.
-- **Workflow smoke tests.** Pipelines (e.g., `IndexingPipelineRunnerTests`) and finalization steps run end-to-end with the fixture-provisioned infrastructure.
-- **Prompt precedence.** `Integration/CommunitySummariesIntegrationTests.cs` proves manual prompt overrides win over auto-tuned assets while still falling back to auto templates when manual text is absent.
-- **Callback and stats instrumentation.** `Runtime/PipelineExecutorTests.cs` now asserts that pipeline callbacks fire and runtime statistics are captured even when workflows fail early, so custom telemetry remains reliable.
-
----
-
-## Pipeline Cache
-
-Pipelines exchange state through the `IPipelineCache` abstraction. Every workflow step receives the same cache instance via `PipelineRunContext`, so it can reuse expensive results (LLM calls, chunk expansions, graph lookups) that were produced earlier in the run instead of recomputing them. The cache also keeps optional debug payloads per entry so you can persist trace metadata alongside the main value.
-
-To use the built-in in-memory cache, register it alongside the standard ASP.NET Core services:
-
-```csharp
-using GraphRag.Cache;
-
-builder.Services.AddMemoryCache();
-builder.Services.AddSingleton();
-```
-
-Prefer a different backend? Implement `IPipelineCache` yourself and register it through DI—the pipeline will pick up your custom cache automatically.
-
-- **Per-scope isolation.** `MemoryPipelineCache.CreateChild("stage")` scopes keys by prefix (`parent:stage:key`). Calling `ClearAsync` on the parent removes every nested key, so multi-step workflows do not leak data between stages.
-- **Debug traces.** The cache stores optional debug payloads per entry; `DeleteAsync` and `ClearAsync` always clear these traces, preventing the diagnostic dictionary from growing unbounded.
-- **Lifecycle guidance.** Create the root cache once per pipeline run (the default context factory does this for you) and spawn children inside individual workflows when you need an isolated namespace.
+6. **Format before committing**
+ ```bash
+ dotnet format GraphRag.slnx
+ ```
---
-## Language Model Registration
+## Using GraphRAG in Your Application
-GraphRAG delegates language-model configuration to [Microsoft.Extensions.AI](https://learn.microsoft.com/dotnet/ai/overview). Register keyed clients for every `ModelId` you reference in configuration—pick any string key that matches your config:
+Register GraphRAG services and provide keyed Microsoft.Extensions.AI clients for every model reference:
```csharp
using Azure;
using Azure.AI.OpenAI;
+using GraphRag;
using GraphRag.Config;
using Microsoft.Extensions.AI;
var openAi = new OpenAIClient(new Uri(endpoint), new AzureKeyCredential(key));
-const string chatModelId = "chat_model";
-const string embeddingModelId = "embedding_model";
builder.Services.AddKeyedSingleton(
- chatModelId,
+ "chat_model",
_ => openAi.GetChatClient(chatDeployment));
builder.Services.AddKeyedSingleton>(
- embeddingModelId,
+ "embedding_model",
_ => openAi.GetEmbeddingClient(embeddingDeployment));
+
+builder.Services.AddGraphRag();
+```
+
+---
+
+## Pipeline Cache & Extensibility
+
+Every workflow in a pipeline shares the same `IPipelineCache` instance via `PipelineRunContext`. The default DI registration wires up `MemoryPipelineCache`, letting workflows reuse expensive intermediate artefacts (LLM responses, chunk expansions, graph lookups) without recomputation. Swap in your own implementation by registering `IPipelineCache` before invoking `AddGraphRag()`—for example to persist cache entries or aggregate diagnostics.
+
+- **Child scopes.** `MemoryPipelineCache.CreateChild("stage")` prefixes keys with the stage name so multi-step workflows remain isolated.
+- **Debug payloads.** Entries can include optional debug data; clearing the cache removes both the value and associated trace metadata.
+- **Custom lifetimes.** Register a scoped cache if you want to align the cache with a single HTTP request rather than the default singleton lifetime.
+
+---
+
+## Heuristic Ingestion & Maintenance
+
+The .NET port incorporates the ingestion behaviours showcased in GraphRag.Net directly inside the indexing pipeline:
+
+- **Overlapping chunk windows** produce coherent context spans that survive community trimming.
+- **Semantic deduplication** drops duplicate text units by comparing embedding cosine similarity against a configurable threshold.
+- **Token-budget trimming** automatically enforces global and per-community token ceilings during summarisation.
+- **Orphan-node linking** reconnects isolated entities through high-confidence relationships before finalisation.
+- **Relationship enhancement & validation** reconciles LLM output with existing edges to avoid duplicates while strengthening weights.
+
+Configure the heuristics via `GraphRagConfig.Heuristics` (for example in `appsettings.json`):
+
+```json
+{
+ "GraphRag": {
+ "Models": [ "chat_model", "embedding_model" ],
+ "EmbedText": {
+ "ModelId": "embedding_model"
+ },
+ "Heuristics": {
+ "MinimumChunkOverlap": 128,
+ "EnableSemanticDeduplication": true,
+ "SemanticDeduplicationThreshold": 0.92,
+ "MaxTokensPerTextUnit": 1200,
+ "MaxDocumentTokenBudget": 6000,
+ "MaxTextUnitsPerRelationship": 6,
+ "LinkOrphanEntities": true,
+ "OrphanLinkMinimumOverlap": 0.25,
+ "OrphanLinkWeight": 0.35,
+ "EnhanceRelationships": true,
+ "RelationshipConfidenceFloor": 0.35
+ }
+ }
+}
```
-Rate limits, retries, and other policies should be configured when you create these clients (for example by wrapping them with `Polly` handlers). `GraphRagConfig.Models` simply tracks the set of model keys that have been registered so overrides can validate references.
+See [`docs/indexing-and-query.md`](docs/indexing-and-query.md) for the full list of knobs and how they map to the original research flow.
---
-## Indexing, Querying, and Prompt Tuning Alignment
+## Community Detection & Graph Analytics
+
+Community creation defaults to the fast label propagation algorithm. Tweak clustering directly through configuration:
+
+```json
+{
+ "GraphRag": {
+ "Models": [ "chat_model", "embedding_model" ],
+ "ClusterGraph": {
+ "Algorithm": "FastLabelPropagation",
+ "MaxIterations": 40,
+ "MaxClusterSize": 25,
+ "UseLargestConnectedComponent": true,
+ "Seed": 3735928559
+ }
+ }
+}
+```
+
+If the graph is sparse, the pipeline falls back to connected components to ensure every node participates in a community. The heuristics integration tests (`Integration/HeuristicMaintenanceIntegrationTests.cs`) cover both the label propagation path and the connected-component fallback.
-The .NET port mirrors the [GraphRAG indexing architecture](https://microsoft.github.io/graphrag/index/overview/) and its query workflows so downstream applications retain parity with the Python reference implementation.
+---
-- **Indexing overview.** Workflows such as `extract_graph`, `create_communities`, and `community_summaries` map 1:1 to the [default data flow](https://microsoft.github.io/graphrag/index/default_dataflow/) and persist the same tables (`text_units`, `entities`, `relationships`, `communities`, `community_reports`, `covariates`). The new prompt template loader honours manual or auto-tuned prompts before falling back to the stock templates in `prompts/`.
-- **Query capabilities.** The query pipeline retains global search, local search, drift search, and question generation semantics described in the [GraphRAG query overview](https://microsoft.github.io/graphrag/query/overview/). Each orchestrator continues to assemble context from the indexed tables so you can reference [global](https://microsoft.github.io/graphrag/query/global_search/) or [local](https://microsoft.github.io/graphrag/query/local_search/) narratives interchangeably.
-- **Prompt tuning.** GraphRAG’s [manual](https://microsoft.github.io/graphrag/prompt_tuning/manual_prompt_tuning/) and [auto](https://microsoft.github.io/graphrag/prompt_tuning/auto_prompt_tuning/) strategies are surfaced through `GraphRagConfig.PromptTuning`. Store custom templates under `prompts/` or point `PromptTuning.Manual.Directory`/`PromptTuning.Auto.Directory` at your tuning outputs. You can also skip files entirely by assigning inline text (multi-line or prefixed with `inline:`) to workflow prompt properties. Stage keys and placeholders are documented in `docs/indexing-and-query.md`.
+## Integration Testing Strategy
-See [`docs/indexing-and-query.md`](docs/indexing-and-query.md) for a deeper mapping between the .NET workflows and the research publications underpinning GraphRAG.
+- **Real services only.** All graph operations run against containerised Neo4j and Apache AGE/PostgreSQL instances provisioned by Testcontainers.
+- **Deterministic heuristics.** `StubEmbeddingGenerator` guarantees stable embeddings so semantic-dedup and token-budget assertions remain reliable.
+- **Cross-store validation.** Shared integration fixtures verify that workflows succeed against each adapter (Cosmos tests activate when the emulator connection string is present).
+- **Prompt precedence.** Tests validate that manual prompt overrides win over auto-tuned variants while still cascading correctly to the default templates.
+- **Telemetry coverage.** Runtime tests assert pipeline callbacks and execution statistics so custom instrumentation keeps working.
+
+To run just the container-backed suite:
+
+```bash
+dotnet test tests/ManagedCode.GraphRag.Tests/ManagedCode.GraphRag.Tests.csproj \
+ --filter "Category=Integration" \
+ --logger "console;verbosity=normal"
+```
+
+---
+
+## Additional Documentation & Diagrams
+
+- [`docs/indexing-and-query.md`](docs/indexing-and-query.md) explains how each workflow maps to the GraphRAG research diagrams (default data flow, query orchestrations, prompt tuning strategies) published at [microsoft.github.io/graphrag](https://microsoft.github.io/graphrag/).
+- [`docs/dotnet-port-plan.md`](docs/dotnet-port-plan.md) outlines the migration strategy from Python to .NET and references the canonical architecture diagrams used during the port.
+- The upstream documentation contains the latest diagrams for indexing, query, and data schema. Use those diagrams when presenting the system—it matches the pipeline implemented here.
---
@@ -160,40 +218,35 @@ See [`docs/indexing-and-query.md`](docs/indexing-and-query.md) for a deeper mapp
1. Install and start the [Azure Cosmos DB Emulator](https://learn.microsoft.com/azure/cosmos-db/local-emulator).
2. Export the connection string:
```bash
- export COSMOS_EMULATOR_CONNECTION_STRING="AccountEndpoint=https://localhost:8081/;AccountKey=…;"
+ export COSMOS_EMULATOR_CONNECTION_STRING="AccountEndpoint=https://localhost:8081/;AccountKey=..."
```
-3. Rerun `dotnet test`; Cosmos scenarios will seed databases & verify relationships without additional setup.
+3. Run `dotnet test`; Cosmos-specific scenarios will seed the emulator and validate storage behaviour.
---
## Development Tips
-- **Solution layout.** Use `GraphRag.slnx` in Visual Studio/VS Code/Rider for a complete workspace view.
-- **Formatting / analyzers.** Run `dotnet format GraphRag.slnx` before committing to satisfy the repo analyzers.
-- **Coding conventions.**
- - `nullable` and implicit usings are enabled; keep annotations accurate.
- - Async methods should follow the `Async` suffix convention.
- - Prefer DI helpers in `ManagedCode.GraphRag` when wiring new services.
-- **Graph adapters.** Implement additional backends by conforming to `IGraphStore` and registering via `IServiceCollection`.
+- **Solution layout.** Open `GraphRag.slnx` in your IDE for a full workspace view.
+- **Formatting & analyzers.** Run `dotnet format GraphRag.slnx` before committing.
+- **Coding conventions.** Nullable reference types and implicit usings are enabled; keep annotations accurate and suffix async methods with `Async`.
+- **Extending graph adapters.** Implement `IGraphStore` and register your service through DI when adding new storage back-ends.
---
## Contributing
-1. Fork and clone the repo.
-2. Create a feature branch from `main`.
-3. Follow the repository rules (build before testing; integration tests must use real containers).
-4. Submit a PR referencing any related issues. Include `dotnet test GraphRag.slnx` output in the PR body.
+1. Fork the repository and create a feature branch from `main`.
+2. Make your changes, ensuring `dotnet build GraphRag.slnx` succeeds before you run tests.
+3. Execute `dotnet test GraphRag.slnx` (with Docker running) and `dotnet format GraphRag.slnx` before opening a pull request.
+4. Include the test output in your PR description and link any related issues.
-See `CONTRIBUTING.md` for coding standards and PR expectations.
+See [`CONTRIBUTING.md`](CONTRIBUTING.md) for detailed guidance.
---
## License & Credits
- Licensed under the [MIT License](LICENSE).
-- Original Python implementation © Microsoft; see the `graphrag-python` submodule for upstream documentation and examples.
-
----
+- GraphRAG is © Microsoft. This repository reimplements the pipelines for the .NET ecosystem while staying aligned with the official documentation and diagrams.
-Have questions or found a bug? Open an issue or start a discussion—we’re actively evolving the .NET port and welcome feedback. 🚀
+Have questions or feedback? Open an issue or start a discussion—we’re actively evolving the .NET port and welcome contributions! 🚀
diff --git a/src/ManagedCode.GraphRag/Community/CommunityBuilder.cs b/src/ManagedCode.GraphRag/Community/CommunityBuilder.cs
index 9cf716b611..eb4f9f7c05 100644
--- a/src/ManagedCode.GraphRag/Community/CommunityBuilder.cs
+++ b/src/ManagedCode.GraphRag/Community/CommunityBuilder.cs
@@ -23,53 +23,12 @@ public static IReadOnlyList Build(
return Array.Empty();
}
- var adjacency = BuildAdjacency(entities, relationships);
var titleLookup = entities.ToDictionary(entity => entity.Title, StringComparer.OrdinalIgnoreCase);
- var random = new Random(config.Seed);
-
- var orderedTitles = titleLookup.Keys
- .OrderBy(_ => random.Next())
- .ToList();
-
- var visited = new HashSet(StringComparer.OrdinalIgnoreCase);
- var components = new List>();
-
- foreach (var title in orderedTitles)
+ var components = config.Algorithm switch
{
- if (!visited.Add(title))
- {
- continue;
- }
-
- var component = new List();
- var queue = new Queue();
- queue.Enqueue(title);
-
- while (queue.Count > 0)
- {
- var current = queue.Dequeue();
- component.Add(current);
-
- if (!adjacency.TryGetValue(current, out var neighbors) || neighbors.Count == 0)
- {
- continue;
- }
-
- var orderedNeighbors = neighbors
- .OrderBy(_ => random.Next())
- .ToList();
-
- foreach (var neighbor in orderedNeighbors)
- {
- if (visited.Add(neighbor))
- {
- queue.Enqueue(neighbor);
- }
- }
- }
-
- components.Add(component);
- }
+ CommunityDetectionAlgorithm.FastLabelPropagation => BuildUsingLabelPropagation(entities, relationships, config),
+ _ => BuildUsingConnectedComponents(entities, relationships, config)
+ };
if (config.UseLargestConnectedComponent && components.Count > 0)
{
@@ -183,6 +142,90 @@ public static IReadOnlyList Build(
return communityRecords;
}
+ private static List> BuildUsingConnectedComponents(
+ IReadOnlyList entities,
+ IReadOnlyList relationships,
+ ClusterGraphConfig config)
+ {
+ var adjacency = BuildAdjacency(entities, relationships);
+ var random = new Random(config.Seed);
+ var orderedTitles = adjacency.Keys
+ .OrderBy(_ => random.Next())
+ .ToList();
+
+ var visited = new HashSet(StringComparer.OrdinalIgnoreCase);
+ var components = new List>();
+
+ foreach (var title in orderedTitles)
+ {
+ if (!visited.Add(title))
+ {
+ continue;
+ }
+
+ var component = new List();
+ var queue = new Queue();
+ queue.Enqueue(title);
+
+ while (queue.Count > 0)
+ {
+ var current = queue.Dequeue();
+ component.Add(current);
+
+ if (!adjacency.TryGetValue(current, out var neighbors) || neighbors.Count == 0)
+ {
+ continue;
+ }
+
+ var orderedNeighbors = neighbors
+ .OrderBy(_ => random.Next())
+ .ToList();
+
+ foreach (var neighbor in orderedNeighbors.Where(visited.Add))
+ {
+ queue.Enqueue(neighbor);
+ }
+ }
+
+ components.Add(component);
+ }
+
+ return components;
+ }
+
+ private static List> BuildUsingLabelPropagation(
+ IReadOnlyList entities,
+ IReadOnlyList relationships,
+ ClusterGraphConfig config)
+ {
+ var assignments = FastLabelPropagationCommunityDetector.AssignLabels(entities, relationships, config);
+ if (assignments.Count == 0)
+ {
+ return new List>();
+ }
+
+ var groups = new Dictionary>(StringComparer.OrdinalIgnoreCase);
+
+ foreach (var pair in assignments)
+ {
+ if (!groups.TryGetValue(pair.Value, out var members))
+ {
+ members = new List();
+ groups[pair.Value] = members;
+ }
+
+ members.Add(pair.Key);
+ }
+
+ return groups.Values
+ .Select(list => list
+ .Distinct(StringComparer.OrdinalIgnoreCase)
+ .OrderBy(title => title, StringComparer.OrdinalIgnoreCase)
+ .ToList())
+ .Where(list => list.Count > 0)
+ .ToList();
+ }
+
private static Dictionary> BuildAdjacency(
IReadOnlyList entities,
IReadOnlyList relationships)
diff --git a/src/ManagedCode.GraphRag/Community/FastLabelPropagationCommunityDetector.cs b/src/ManagedCode.GraphRag/Community/FastLabelPropagationCommunityDetector.cs
new file mode 100644
index 0000000000..8187c05607
--- /dev/null
+++ b/src/ManagedCode.GraphRag/Community/FastLabelPropagationCommunityDetector.cs
@@ -0,0 +1,111 @@
+using GraphRag.Config;
+using GraphRag.Entities;
+using GraphRag.Relationships;
+
+namespace GraphRag.Community;
+
+internal static class FastLabelPropagationCommunityDetector
+{
+ public static IReadOnlyDictionary AssignLabels(
+ IReadOnlyList entities,
+ IReadOnlyList relationships,
+ ClusterGraphConfig config)
+ {
+ ArgumentNullException.ThrowIfNull(entities);
+ ArgumentNullException.ThrowIfNull(relationships);
+ ArgumentNullException.ThrowIfNull(config);
+
+ var adjacency = BuildAdjacency(entities, relationships);
+ if (adjacency.Count == 0)
+ {
+ return new Dictionary(StringComparer.OrdinalIgnoreCase);
+ }
+
+ var random = new Random(config.Seed);
+ var labels = adjacency.Keys.ToDictionary(node => node, node => node, StringComparer.OrdinalIgnoreCase);
+ var nodes = adjacency.Keys.ToList();
+ var maxIterations = Math.Max(1, config.MaxIterations);
+
+ for (var iteration = 0; iteration < maxIterations; iteration++)
+ {
+ var shuffled = nodes.OrderBy(_ => random.Next()).ToList();
+ var changed = false;
+
+ foreach (var node in shuffled)
+ {
+ var neighbors = adjacency[node];
+ if (neighbors.Count == 0)
+ {
+ continue;
+ }
+
+ var labelWeights = new Dictionary(StringComparer.OrdinalIgnoreCase);
+ foreach (var (neighbor, weight) in neighbors)
+ {
+ if (!labels.TryGetValue(neighbor, out var neighborLabel))
+ {
+ continue;
+ }
+
+ labelWeights[neighborLabel] = labelWeights.GetValueOrDefault(neighborLabel) + (weight > 0 ? weight : 1);
+ }
+
+ if (labelWeights.Count == 0)
+ {
+ continue;
+ }
+
+ var maxWeight = labelWeights.Values.Max();
+ var candidates = labelWeights
+ .Where(pair => Math.Abs(pair.Value - maxWeight) < 1e-6)
+ .Select(pair => pair.Key)
+ .ToList();
+
+ var chosen = candidates.Count == 1
+ ? candidates[0]
+ : candidates[random.Next(candidates.Count)];
+
+ if (!string.Equals(labels[node], chosen, StringComparison.OrdinalIgnoreCase))
+ {
+ labels[node] = chosen;
+ changed = true;
+ }
+ }
+
+ if (!changed)
+ {
+ break;
+ }
+ }
+
+ return labels;
+ }
+
+ private static Dictionary> BuildAdjacency(
+ IReadOnlyList entities,
+ IReadOnlyList relationships)
+ {
+ var adjacency = entities
+ .ToDictionary(entity => entity.Title, _ => new List<(string, double)>(), StringComparer.OrdinalIgnoreCase);
+
+ foreach (var relationship in relationships)
+ {
+ if (!adjacency.TryGetValue(relationship.Source, out var sourceNeighbors))
+ {
+ sourceNeighbors = new List<(string, double)>();
+ adjacency[relationship.Source] = sourceNeighbors;
+ }
+
+ if (!adjacency.TryGetValue(relationship.Target, out var targetNeighbors))
+ {
+ targetNeighbors = new List<(string, double)>();
+ adjacency[relationship.Target] = targetNeighbors;
+ }
+
+ sourceNeighbors.Add((relationship.Target, relationship.Weight));
+ targetNeighbors.Add((relationship.Source, relationship.Weight));
+ }
+
+ return adjacency;
+ }
+}
diff --git a/src/ManagedCode.GraphRag/Config/ClusterGraphConfig.cs b/src/ManagedCode.GraphRag/Config/ClusterGraphConfig.cs
index 16a3b43eaa..a0426217ec 100644
--- a/src/ManagedCode.GraphRag/Config/ClusterGraphConfig.cs
+++ b/src/ManagedCode.GraphRag/Config/ClusterGraphConfig.cs
@@ -22,4 +22,17 @@ public sealed class ClusterGraphConfig
/// results deterministic across runs.
///
public int Seed { get; set; } = unchecked((int)0xDEADBEEF);
+
+ ///
+ /// Gets or sets the maximum number of label propagation iterations when the
+ /// algorithm is used.
+ ///
+ public int MaxIterations { get; set; } = 20;
+
+ ///
+ /// Gets or sets the community detection algorithm. The fast label propagation
+ /// implementation mirrors the in-process routine provided by GraphRag.Net.
+ ///
+ public CommunityDetectionAlgorithm Algorithm { get; set; }
+ = CommunityDetectionAlgorithm.FastLabelPropagation;
}
diff --git a/src/ManagedCode.GraphRag/Config/Enums.cs b/src/ManagedCode.GraphRag/Config/Enums.cs
index d81d3ff080..677fe0698c 100644
--- a/src/ManagedCode.GraphRag/Config/Enums.cs
+++ b/src/ManagedCode.GraphRag/Config/Enums.cs
@@ -63,3 +63,9 @@ public enum ModularityMetric
Lcc,
WeightedComponents
}
+
+public enum CommunityDetectionAlgorithm
+{
+ FastLabelPropagation,
+ ConnectedComponents
+}
diff --git a/src/ManagedCode.GraphRag/Config/GraphRagConfig.cs b/src/ManagedCode.GraphRag/Config/GraphRagConfig.cs
index bfdd062b64..df95321f88 100644
--- a/src/ManagedCode.GraphRag/Config/GraphRagConfig.cs
+++ b/src/ManagedCode.GraphRag/Config/GraphRagConfig.cs
@@ -42,6 +42,8 @@ public sealed class GraphRagConfig
public ClusterGraphConfig ClusterGraph { get; set; } = new();
+ public HeuristicMaintenanceConfig Heuristics { get; set; } = new();
+
public CommunityReportsConfig CommunityReports { get; set; } = new();
public PromptTuningConfig PromptTuning { get; set; } = new();
diff --git a/src/ManagedCode.GraphRag/Config/HeuristicMaintenanceConfig.cs b/src/ManagedCode.GraphRag/Config/HeuristicMaintenanceConfig.cs
new file mode 100644
index 0000000000..9f6cccb680
--- /dev/null
+++ b/src/ManagedCode.GraphRag/Config/HeuristicMaintenanceConfig.cs
@@ -0,0 +1,84 @@
+namespace GraphRag.Config;
+
+///
+/// Represents heuristic controls that fine-tune ingestion and graph maintenance behavior.
+/// The defaults mirror the semantics implemented in the GraphRag.Net demo service where
+/// ingestion aggressively deduplicates, trims by token budgets, and repairs sparse graphs.
+///
+public sealed class HeuristicMaintenanceConfig
+{
+ ///
+ /// Gets or sets a value indicating whether semantic deduplication should be applied
+ /// to text units produced during ingestion. When enabled, text chunks that are deemed
+ /// near-duplicates are merged so downstream LLM prompts are not wasted on redundant
+ /// context.
+ ///
+ public bool EnableSemanticDeduplication { get; set; } = true;
+
+ ///
+ /// Gets or sets the cosine similarity threshold used when merging duplicate text units.
+ /// Values should fall within [0,1]; higher values keep the deduplication stricter.
+ ///
+ public double SemanticDeduplicationThreshold { get; set; } = 0.92;
+
+ ///
+ /// Gets or sets the maximum number of tokens permitted within a single text unit.
+ /// Oversized chunks are discarded to keep prompts within model context limits.
+ ///
+ public int MaxTokensPerTextUnit { get; set; } = 1400;
+
+ ///
+ /// Gets or sets the maximum cumulative token budget allocated to each document during
+ /// ingestion. Set to a value less than or equal to zero to disable document level trimming.
+ ///
+ public int MaxDocumentTokenBudget { get; set; } = 6000;
+
+ ///
+ /// Gets or sets the maximum number of text units that should remain attached to a
+ /// relationship when persisting graph edges. Excess associations are trimmed to keep
+ /// follow-up prompts compact.
+ ///
+ public int MaxTextUnitsPerRelationship { get; set; } = 6;
+
+ ///
+ /// Gets or sets the minimum amount of overlap (expressed as a ratio) required when linking
+ /// orphan entities. The ratio compares shared text units against the smaller of the
+ /// participating entity sets.
+ ///
+ public double OrphanLinkMinimumOverlap { get; set; } = 0.2;
+
+ ///
+ /// Gets or sets the default weight assigned to synthetic orphan relationships.
+ ///
+ public double OrphanLinkWeight { get; set; } = 0.35;
+
+ ///
+ /// Gets or sets a value indicating whether relationship heuristics should normalise,
+ /// validate, and enhance extracted edges.
+ ///
+ public bool EnhanceRelationships { get; set; } = true;
+
+ ///
+ /// Gets or sets the minimum weight enforced when relationship heuristics run. Extracted
+ /// relationships that fall below this floor (after normalisation) are bumped up so they
+ /// remain queryable.
+ ///
+ public double RelationshipConfidenceFloor { get; set; } = 0.35;
+
+ ///
+ /// Gets or sets the minimum overlap (in tokens) required when chunking source documents.
+ ///
+ public int MinimumChunkOverlap { get; set; } = 80;
+
+ ///
+ /// Gets or sets an optional keyed model id used to resolve a text embedding generator.
+ /// When not supplied, the pipeline falls back to .
+ ///
+ public string? EmbeddingModelId { get; set; }
+
+ ///
+ /// Gets or sets a value indicating whether orphan entities should be linked back into the
+ /// graph using co-occurrence heuristics.
+ ///
+ public bool LinkOrphanEntities { get; set; } = true;
+}
diff --git a/src/ManagedCode.GraphRag/Indexing/Heuristics/GraphExtractionHeuristics.cs b/src/ManagedCode.GraphRag/Indexing/Heuristics/GraphExtractionHeuristics.cs
new file mode 100644
index 0000000000..940d275f50
--- /dev/null
+++ b/src/ManagedCode.GraphRag/Indexing/Heuristics/GraphExtractionHeuristics.cs
@@ -0,0 +1,283 @@
+using GraphRag.Config;
+using GraphRag.Data;
+using GraphRag.Entities;
+using GraphRag.Relationships;
+using Microsoft.Extensions.Logging;
+
+namespace GraphRag.Indexing.Heuristics;
+
+internal static class GraphExtractionHeuristics
+{
+ public static (IReadOnlyList Entities, IReadOnlyList Relationships) Apply(
+ IReadOnlyList entities,
+ IReadOnlyList relationships,
+ IReadOnlyList textUnits,
+ HeuristicMaintenanceConfig heuristics,
+ ILogger? logger)
+ {
+ if (entities.Count == 0)
+ {
+ return (entities, relationships);
+ }
+
+ var enhancedRelationships = heuristics.EnhanceRelationships
+ ? EnhanceRelationships(relationships, heuristics)
+ : relationships.ToList();
+
+ if (heuristics.LinkOrphanEntities)
+ {
+ enhancedRelationships = LinkOrphanEntities(entities, enhancedRelationships, textUnits, heuristics, logger);
+ }
+
+ return (entities, enhancedRelationships);
+ }
+
+ private static List EnhanceRelationships(
+ IReadOnlyList relationships,
+ HeuristicMaintenanceConfig heuristics)
+ {
+ if (relationships.Count == 0)
+ {
+ return new List();
+ }
+
+ var aggregator = new Dictionary(StringComparer.OrdinalIgnoreCase);
+
+ foreach (var relationship in relationships)
+ {
+ var key = BuildRelationshipKey(relationship.Source, relationship.Target, relationship.Type, relationship.Bidirectional);
+ if (!aggregator.TryGetValue(key, out var aggregation))
+ {
+ aggregation = new RelationshipAggregation(relationship.Source, relationship.Target, relationship.Type, relationship.Bidirectional);
+ aggregator[key] = aggregation;
+ }
+
+ aggregation.Add(relationship);
+ }
+
+ return aggregator.Values
+ .Select(aggregation => aggregation.ToSeed(heuristics))
+ .OrderBy(seed => seed.Source, StringComparer.OrdinalIgnoreCase)
+ .ThenBy(seed => seed.Target, StringComparer.OrdinalIgnoreCase)
+ .ThenBy(seed => seed.Type, StringComparer.OrdinalIgnoreCase)
+ .ToList();
+ }
+
+ private static List LinkOrphanEntities(
+ IReadOnlyList entities,
+ IReadOnlyList relationships,
+ IReadOnlyList textUnits,
+ HeuristicMaintenanceConfig heuristics,
+ ILogger? logger)
+ {
+ var relationshipMap = new Dictionary>(StringComparer.OrdinalIgnoreCase);
+ var textUnitIndex = textUnits
+ .GroupBy(unit => unit.Id, StringComparer.OrdinalIgnoreCase)
+ .ToDictionary(group => group.Key, group => group.First(), StringComparer.OrdinalIgnoreCase);
+
+ foreach (var relationship in relationships)
+ {
+ Register(relationship.Source, relationship.Target);
+ Register(relationship.Target, relationship.Source);
+ }
+
+ var textUnitLookup = entities
+ .ToDictionary(entity => entity.Title, entity => new HashSet(entity.TextUnitIds, StringComparer.OrdinalIgnoreCase), StringComparer.OrdinalIgnoreCase);
+
+ var orphanEntities = entities
+ .Where(entity => !relationshipMap.TryGetValue(entity.Title, out var edges) || edges.Count == 0)
+ .ToList();
+
+ if (orphanEntities.Count == 0)
+ {
+ return relationships.ToList();
+ }
+
+ var updatedRelationships = relationships.ToList();
+ var existingKeys = new HashSet(updatedRelationships
+ .Select(rel => BuildRelationshipKey(rel.Source, rel.Target, rel.Type, rel.Bidirectional)), StringComparer.OrdinalIgnoreCase);
+
+ foreach (var orphan in orphanEntities)
+ {
+ if (!textUnitLookup.TryGetValue(orphan.Title, out var orphanUnits) || orphanUnits.Count == 0)
+ {
+ continue;
+ }
+
+ EntitySeed? bestMatch = null;
+ double bestScore = 0;
+
+ foreach (var candidate in entities)
+ {
+ if (string.Equals(candidate.Title, orphan.Title, StringComparison.OrdinalIgnoreCase))
+ {
+ continue;
+ }
+
+ if (!textUnitLookup.TryGetValue(candidate.Title, out var candidateUnits) || candidateUnits.Count == 0)
+ {
+ continue;
+ }
+
+ var overlap = orphanUnits.Intersect(candidateUnits, StringComparer.OrdinalIgnoreCase).Count();
+ if (overlap == 0)
+ {
+ continue;
+ }
+
+ var overlapRatio = overlap / (double)Math.Min(orphanUnits.Count, candidateUnits.Count);
+ if (overlapRatio < heuristics.OrphanLinkMinimumOverlap)
+ {
+ continue;
+ }
+
+ if (overlapRatio > bestScore)
+ {
+ bestScore = overlapRatio;
+ bestMatch = candidate;
+ }
+ }
+
+ if (bestMatch is null)
+ {
+ continue;
+ }
+
+ var sharedUnits = orphanUnits.Intersect(textUnitLookup[bestMatch.Title], StringComparer.OrdinalIgnoreCase)
+ .Select(id => (Id: id, Tokens: textUnitIndex.TryGetValue(id, out var record) ? record.TokenCount : 0))
+ .OrderByDescending(tuple => tuple.Tokens)
+ .Select(tuple => tuple.Id)
+ .Take(heuristics.MaxTextUnitsPerRelationship > 0 ? heuristics.MaxTextUnitsPerRelationship : int.MaxValue)
+ .ToArray();
+
+ var fallbackUnits = orphanUnits
+ .Select(id => (Id: id, Tokens: textUnitIndex.TryGetValue(id, out var record) ? record.TokenCount : 0))
+ .OrderByDescending(tuple => tuple.Tokens)
+ .Select(tuple => tuple.Id)
+ .Take(heuristics.MaxTextUnitsPerRelationship > 0 ? heuristics.MaxTextUnitsPerRelationship : orphanUnits.Count)
+ .ToArray();
+
+ var synthetic = new RelationshipSeed(
+ orphan.Title,
+ bestMatch.Title,
+ $"{orphan.Title} relates to {bestMatch.Title}",
+ heuristics.OrphanLinkWeight,
+ sharedUnits.Length > 0 ? sharedUnits : fallbackUnits)
+ {
+ Bidirectional = true
+ };
+
+ var key = BuildRelationshipKey(synthetic.Source, synthetic.Target, synthetic.Type, synthetic.Bidirectional);
+ if (existingKeys.Add(key))
+ {
+ updatedRelationships.Add(synthetic);
+ Register(synthetic.Source, synthetic.Target);
+ Register(synthetic.Target, synthetic.Source);
+ logger?.LogDebug(
+ "Linked orphan entity {Orphan} with {Target} using {Overlap} shared text units.",
+ orphan.Title,
+ bestMatch.Title,
+ sharedUnits.Length);
+ }
+ }
+
+ return updatedRelationships;
+
+ void Register(string source, string target)
+ {
+ if (string.IsNullOrWhiteSpace(source) || string.IsNullOrWhiteSpace(target))
+ {
+ return;
+ }
+
+ if (!relationshipMap.TryGetValue(source, out var neighbors))
+ {
+ neighbors = new HashSet(StringComparer.OrdinalIgnoreCase);
+ relationshipMap[source] = neighbors;
+ }
+
+ neighbors.Add(target);
+ }
+ }
+
+ private static string BuildRelationshipKey(string source, string target, string? type, bool bidirectional)
+ {
+ var relationshipType = string.IsNullOrWhiteSpace(type) ? "related_to" : type;
+ if (bidirectional && string.Compare(source, target, StringComparison.OrdinalIgnoreCase) > 0)
+ {
+ (source, target) = (target, source);
+ }
+
+ return $"{source}::{target}::{relationshipType}";
+ }
+
+ private sealed class RelationshipAggregation(string source, string target, string? type, bool bidirectional)
+ {
+ private readonly string _source = source;
+ private readonly string _target = target;
+ private readonly string _type = string.IsNullOrWhiteSpace(type) ? "related_to" : type!;
+ private readonly bool _bidirectional = bidirectional;
+ private readonly HashSet _textUnits = new(StringComparer.OrdinalIgnoreCase);
+
+ private double _weightSum;
+ private int _count;
+ private string? _description;
+
+ public void Add(RelationshipSeed seed)
+ {
+ _weightSum += seed.Weight;
+ _count++;
+ _description = SelectDescription(_description, seed.Description);
+
+ foreach (var textUnit in seed.TextUnitIds.Where(static id => !string.IsNullOrWhiteSpace(id)))
+ {
+ _textUnits.Add(textUnit);
+ }
+ }
+
+ public RelationshipSeed ToSeed(HeuristicMaintenanceConfig heuristics)
+ {
+ var weight = _count > 0 ? _weightSum / _count : heuristics.RelationshipConfidenceFloor;
+ if (weight < heuristics.RelationshipConfidenceFloor)
+ {
+ weight = heuristics.RelationshipConfidenceFloor;
+ }
+
+ var textUnits = _textUnits
+ .OrderBy(id => id, StringComparer.OrdinalIgnoreCase)
+ .ToList();
+
+ if (heuristics.MaxTextUnitsPerRelationship > 0 && textUnits.Count > heuristics.MaxTextUnitsPerRelationship)
+ {
+ textUnits = textUnits.Take(heuristics.MaxTextUnitsPerRelationship).ToList();
+ }
+
+ return new RelationshipSeed(
+ _source,
+ _target,
+ _description ?? $"{_source} relates to {_target}",
+ weight,
+ textUnits)
+ {
+ Type = _type,
+ Bidirectional = _bidirectional
+ };
+ }
+
+ private static string? SelectDescription(string? existing, string? incoming)
+ {
+ if (string.IsNullOrWhiteSpace(incoming))
+ {
+ return existing;
+ }
+
+ if (string.IsNullOrWhiteSpace(existing))
+ {
+ return incoming;
+ }
+
+ // Prefer shorter descriptions to keep summaries concise and token efficient.
+ return incoming.Length < existing.Length ? incoming : existing;
+ }
+ }
+}
diff --git a/src/ManagedCode.GraphRag/Indexing/Heuristics/TextUnitHeuristicProcessor.cs b/src/ManagedCode.GraphRag/Indexing/Heuristics/TextUnitHeuristicProcessor.cs
new file mode 100644
index 0000000000..2c7b3ecd7b
--- /dev/null
+++ b/src/ManagedCode.GraphRag/Indexing/Heuristics/TextUnitHeuristicProcessor.cs
@@ -0,0 +1,266 @@
+using GraphRag.Config;
+using GraphRag.Data;
+using Microsoft.Extensions.AI;
+using Microsoft.Extensions.DependencyInjection;
+using Microsoft.Extensions.Logging;
+
+namespace GraphRag.Indexing.Heuristics;
+
+internal static class TextUnitHeuristicProcessor
+{
+ public static async Task> ApplyAsync(
+ GraphRagConfig config,
+ IReadOnlyList textUnits,
+ IServiceProvider services,
+ ILogger? logger,
+ CancellationToken cancellationToken)
+ {
+ if (textUnits.Count == 0)
+ {
+ return Array.Empty();
+ }
+
+ var heuristics = config.Heuristics ?? new HeuristicMaintenanceConfig();
+
+ var filtered = ApplyTokenBudgets(textUnits, heuristics);
+ if (filtered.Count == 0)
+ {
+ return filtered;
+ }
+
+ if (!heuristics.EnableSemanticDeduplication)
+ {
+ return filtered;
+ }
+
+ var generator = ResolveEmbeddingGenerator(services, heuristics, config, logger);
+ if (generator is null)
+ {
+ logger?.LogWarning("Semantic deduplication skipped because no text embedding generator is registered.");
+ return filtered;
+ }
+
+ try
+ {
+ return await DeduplicateAsync(filtered, generator, heuristics, cancellationToken).ConfigureAwait(false);
+ }
+ catch (OperationCanceledException)
+ {
+ throw;
+ }
+ catch (Exception ex)
+ {
+ logger?.LogWarning(ex, "Failed to execute semantic deduplication heuristics. Retaining filtered text units only.");
+ return filtered;
+ }
+ }
+
+ private static List ApplyTokenBudgets(
+ IReadOnlyList textUnits,
+ HeuristicMaintenanceConfig heuristics)
+ {
+ var result = new List(textUnits.Count);
+ Dictionary? documentBudgets = null;
+
+ if (heuristics.MaxDocumentTokenBudget > 0)
+ {
+ documentBudgets = new Dictionary(StringComparer.OrdinalIgnoreCase);
+ }
+
+ foreach (var unit in textUnits.OrderBy(static unit => unit.Id, StringComparer.Ordinal))
+ {
+ if (heuristics.MaxTokensPerTextUnit > 0 && unit.TokenCount > heuristics.MaxTokensPerTextUnit)
+ {
+ continue;
+ }
+
+ if (documentBudgets is null)
+ {
+ result.Add(unit);
+ continue;
+ }
+
+ var allowedDocs = new List();
+ foreach (var documentId in unit.DocumentIds)
+ {
+ if (string.IsNullOrWhiteSpace(documentId))
+ {
+ continue;
+ }
+
+ documentBudgets.TryGetValue(documentId, out var usedTokens);
+ if (usedTokens + unit.TokenCount > heuristics.MaxDocumentTokenBudget)
+ {
+ continue;
+ }
+
+ allowedDocs.Add(documentId);
+ }
+
+ if (allowedDocs.Count == 0)
+ {
+ continue;
+ }
+
+ foreach (var documentId in allowedDocs)
+ {
+ documentBudgets[documentId] = documentBudgets.GetValueOrDefault(documentId) + unit.TokenCount;
+ }
+
+ var dedupedDocs = DeduplicatePreservingOrder(allowedDocs);
+
+ if (dedupedDocs.Count == unit.DocumentIds.Count && unit.DocumentIds.SequenceEqual(dedupedDocs, StringComparer.OrdinalIgnoreCase))
+ {
+ result.Add(unit);
+ }
+ else
+ {
+ result.Add(unit with { DocumentIds = dedupedDocs.ToArray() });
+ }
+ }
+
+ return result;
+ }
+
+ private static async Task> DeduplicateAsync(
+ IReadOnlyList textUnits,
+ IEmbeddingGenerator> generator,
+ HeuristicMaintenanceConfig heuristics,
+ CancellationToken cancellationToken)
+ {
+ var clusters = new List(textUnits.Count);
+
+ foreach (var unit in textUnits)
+ {
+ cancellationToken.ThrowIfCancellationRequested();
+
+ var embedding = await generator.GenerateVectorAsync(unit.Text, cancellationToken: cancellationToken).ConfigureAwait(false);
+ var vector = embedding.Length > 0 ? embedding.ToArray() : Array.Empty();
+
+ DeduplicationCluster? match = null;
+ double bestSimilarity = 0;
+
+ foreach (var cluster in clusters)
+ {
+ if (cluster.Vector.Length == 0 || vector.Length == 0)
+ {
+ continue;
+ }
+
+ var similarity = CosineSimilarity(cluster.Vector, vector);
+ if (similarity >= heuristics.SemanticDeduplicationThreshold && similarity > bestSimilarity)
+ {
+ bestSimilarity = similarity;
+ match = cluster;
+ }
+ }
+
+ if (match is null)
+ {
+ clusters.Add(new DeduplicationCluster(unit, vector));
+ continue;
+ }
+
+ match.Update(unit);
+ }
+
+ return clusters
+ .Select(static cluster => cluster.Record)
+ .OrderBy(static record => record.Id, StringComparer.Ordinal)
+ .ToArray();
+ }
+
+ private static double CosineSimilarity(IReadOnlyList left, IReadOnlyList right)
+ {
+ var length = Math.Min(left.Count, right.Count);
+ if (length == 0)
+ {
+ return 0;
+ }
+
+ double dot = 0;
+ double leftMagnitude = 0;
+ double rightMagnitude = 0;
+
+ for (var index = 0; index < length; index++)
+ {
+ var l = left[index];
+ var r = right[index];
+ dot += l * r;
+ leftMagnitude += l * l;
+ rightMagnitude += r * r;
+ }
+
+ if (leftMagnitude <= 0 || rightMagnitude <= 0)
+ {
+ return 0;
+ }
+
+ return dot / (Math.Sqrt(leftMagnitude) * Math.Sqrt(rightMagnitude));
+ }
+
+ private static List DeduplicatePreservingOrder(IEnumerable source)
+ {
+ var seen = new HashSet(StringComparer.OrdinalIgnoreCase);
+ return source
+ .Where(seen.Add)
+ .ToList();
+ }
+
+ private static IEmbeddingGenerator>? ResolveEmbeddingGenerator(
+ IServiceProvider services,
+ HeuristicMaintenanceConfig heuristics,
+ GraphRagConfig config,
+ ILogger? logger)
+ {
+ IEmbeddingGenerator>? generator = null;
+
+ var modelId = heuristics.EmbeddingModelId;
+ if (string.IsNullOrWhiteSpace(modelId))
+ {
+ modelId = config.EmbedText.ModelId;
+ }
+
+ if (!string.IsNullOrWhiteSpace(modelId))
+ {
+ generator = services.GetKeyedService>>(modelId);
+ if (generator is null)
+ {
+ logger?.LogWarning(
+ "GraphRAG could not resolve keyed embedding generator '{ModelId}'. Falling back to the default registration.",
+ modelId);
+ }
+ }
+
+ return generator ?? services.GetService>>();
+ }
+
+ private sealed class DeduplicationCluster(TextUnitRecord record, float[] vector)
+ {
+ public TextUnitRecord Record { get; private set; } = record;
+
+ public float[] Vector { get; } = vector ?? Array.Empty();
+
+ public void Update(TextUnitRecord incoming)
+ {
+ var mergedDocuments = MergeLists(Record.DocumentIds, incoming.DocumentIds);
+ // Sum token counts so merged records reflect their combined budget.
+ var tokenCount = (int)Math.Min((long)Record.TokenCount + incoming.TokenCount, int.MaxValue);
+
+ Record = Record with
+ {
+ DocumentIds = mergedDocuments,
+ TokenCount = tokenCount
+ };
+ }
+
+ private static string[] MergeLists(IReadOnlyList first, IReadOnlyList second)
+ {
+ var seen = new HashSet(StringComparer.OrdinalIgnoreCase);
+ return first
+ .Concat(second)
+ .Where(seen.Add)
+ .ToArray();
+ }
+ }
+}
diff --git a/src/ManagedCode.GraphRag/Indexing/Runtime/IndexingPipelineDefinitions.cs b/src/ManagedCode.GraphRag/Indexing/Runtime/IndexingPipelineDefinitions.cs
index 2bad212dc9..e75ca605b5 100644
--- a/src/ManagedCode.GraphRag/Indexing/Runtime/IndexingPipelineDefinitions.cs
+++ b/src/ManagedCode.GraphRag/Indexing/Runtime/IndexingPipelineDefinitions.cs
@@ -8,6 +8,7 @@ public static class IndexingPipelineDefinitions
{
LoadInputDocumentsWorkflow.Name,
CreateBaseTextUnitsWorkflow.Name,
+ HeuristicMaintenanceWorkflow.Name,
ExtractGraphWorkflow.Name,
CreateCommunitiesWorkflow.Name,
CommunitySummariesWorkflow.Name,
diff --git a/src/ManagedCode.GraphRag/Indexing/Workflows/CreateBaseTextUnitsWorkflow.cs b/src/ManagedCode.GraphRag/Indexing/Workflows/CreateBaseTextUnitsWorkflow.cs
index 326a4e4670..b1657a4236 100644
--- a/src/ManagedCode.GraphRag/Indexing/Workflows/CreateBaseTextUnitsWorkflow.cs
+++ b/src/ManagedCode.GraphRag/Indexing/Workflows/CreateBaseTextUnitsWorkflow.cs
@@ -24,7 +24,12 @@ public static WorkflowDelegate Create()
var textUnits = new List();
var callbacks = context.Callbacks;
- var chunkingConfig = config.Chunks;
+ var heuristicConfig = config.Heuristics ?? new HeuristicMaintenanceConfig();
+ var chunkingConfig = CloneChunkingConfig(config.Chunks);
+ if (heuristicConfig.MinimumChunkOverlap > 0 && chunkingConfig.Overlap < heuristicConfig.MinimumChunkOverlap)
+ {
+ chunkingConfig.Overlap = heuristicConfig.MinimumChunkOverlap;
+ }
var chunkerResolver = context.Services.GetRequiredService();
var chunker = chunkerResolver.Resolve(chunkingConfig.Strategy);
@@ -97,6 +102,22 @@ public static WorkflowDelegate Create()
return builder.ToString();
}
+ private static ChunkingConfig CloneChunkingConfig(ChunkingConfig source)
+ {
+ return new ChunkingConfig
+ {
+ Size = source.Size,
+ Overlap = source.Overlap,
+ EncodingModel = source.EncodingModel,
+ Strategy = source.Strategy,
+ PrependMetadata = source.PrependMetadata,
+ ChunkSizeIncludesMetadata = source.ChunkSizeIncludesMetadata,
+ GroupByColumns = source.GroupByColumns is { Count: > 0 }
+ ? new List(source.GroupByColumns)
+ : new List()
+ };
+ }
+
private static ChunkingConfig CreateEffectiveConfig(ChunkingConfig original, int metadataTokens)
{
if (!original.ChunkSizeIncludesMetadata || metadataTokens == 0)
diff --git a/src/ManagedCode.GraphRag/Indexing/Workflows/ExtractGraphWorkflow.cs b/src/ManagedCode.GraphRag/Indexing/Workflows/ExtractGraphWorkflow.cs
index aff1191aff..77c2d35f3b 100644
--- a/src/ManagedCode.GraphRag/Indexing/Workflows/ExtractGraphWorkflow.cs
+++ b/src/ManagedCode.GraphRag/Indexing/Workflows/ExtractGraphWorkflow.cs
@@ -4,6 +4,7 @@
using GraphRag.Data;
using GraphRag.Entities;
using GraphRag.Finalization;
+using GraphRag.Indexing.Heuristics;
using GraphRag.Indexing.Runtime;
using GraphRag.LanguageModels;
using GraphRag.Relationships;
@@ -99,7 +100,18 @@ public static WorkflowDelegate Create()
}
}
- var finalization = GraphFinalizer.Finalize(entityAggregator.ToSeeds(), relationshipAggregator.ToSeeds());
+ var entitySeeds = entityAggregator.ToSeeds().ToList();
+ var relationshipSeeds = relationshipAggregator.ToSeeds().ToList();
+
+ var heuristics = config.Heuristics ?? new HeuristicMaintenanceConfig();
+ if ((heuristics.EnhanceRelationships && relationshipSeeds.Count > 0) || heuristics.LinkOrphanEntities)
+ {
+ var adjusted = GraphExtractionHeuristics.Apply(entitySeeds, relationshipSeeds, textUnits, heuristics, logger);
+ entitySeeds = adjusted.Entities.ToList();
+ relationshipSeeds = adjusted.Relationships.ToList();
+ }
+
+ var finalization = GraphFinalizer.Finalize(entitySeeds, relationshipSeeds);
await context.OutputStorage
.WriteTableAsync(PipelineTableNames.Entities, finalization.Entities, cancellationToken)
diff --git a/src/ManagedCode.GraphRag/Indexing/Workflows/HeuristicMaintenanceWorkflow.cs b/src/ManagedCode.GraphRag/Indexing/Workflows/HeuristicMaintenanceWorkflow.cs
new file mode 100644
index 0000000000..92fb155372
--- /dev/null
+++ b/src/ManagedCode.GraphRag/Indexing/Workflows/HeuristicMaintenanceWorkflow.cs
@@ -0,0 +1,42 @@
+using GraphRag.Constants;
+using GraphRag.Data;
+using GraphRag.Indexing.Heuristics;
+using GraphRag.Indexing.Runtime;
+using GraphRag.Storage;
+using Microsoft.Extensions.DependencyInjection;
+using Microsoft.Extensions.Logging;
+
+namespace GraphRag.Indexing.Workflows;
+
+internal static class HeuristicMaintenanceWorkflow
+{
+ public const string Name = "heuristic_maintenance";
+
+ public static WorkflowDelegate Create()
+ {
+ return async (config, context, cancellationToken) =>
+ {
+ var textUnits = await context.OutputStorage
+ .LoadTableAsync(PipelineTableNames.TextUnits, cancellationToken)
+ .ConfigureAwait(false);
+
+ if (textUnits.Count == 0)
+ {
+ return new WorkflowResult(Array.Empty());
+ }
+
+ var loggerFactory = context.Services.GetService();
+ var logger = loggerFactory?.CreateLogger(typeof(HeuristicMaintenanceWorkflow));
+
+ var processed = await TextUnitHeuristicProcessor
+ .ApplyAsync(config, textUnits, context.Services, logger, cancellationToken)
+ .ConfigureAwait(false);
+
+ await context.OutputStorage
+ .WriteTableAsync(PipelineTableNames.TextUnits, processed, cancellationToken)
+ .ConfigureAwait(false);
+
+ return new WorkflowResult(processed);
+ };
+ }
+}
diff --git a/src/ManagedCode.GraphRag/ServiceCollectionExtensions.cs b/src/ManagedCode.GraphRag/ServiceCollectionExtensions.cs
index 33a9937331..dbaf8dfaf6 100644
--- a/src/ManagedCode.GraphRag/ServiceCollectionExtensions.cs
+++ b/src/ManagedCode.GraphRag/ServiceCollectionExtensions.cs
@@ -18,6 +18,7 @@ public static IServiceCollection AddGraphRag(this IServiceCollection services)
services.AddKeyedSingleton("noop", static (_, _) => (config, context, cancellationToken) => ValueTask.FromResult(new WorkflowResult(null)));
services.AddKeyedSingleton(Indexing.Workflows.LoadInputDocumentsWorkflow.Name, static (_, _) => Indexing.Workflows.LoadInputDocumentsWorkflow.Create());
services.AddKeyedSingleton(Indexing.Workflows.CreateBaseTextUnitsWorkflow.Name, static (_, _) => Indexing.Workflows.CreateBaseTextUnitsWorkflow.Create());
+ services.AddKeyedSingleton(Indexing.Workflows.HeuristicMaintenanceWorkflow.Name, static (_, _) => Indexing.Workflows.HeuristicMaintenanceWorkflow.Create());
services.AddKeyedSingleton(Indexing.Workflows.ExtractGraphWorkflow.Name, static (_, _) => Indexing.Workflows.ExtractGraphWorkflow.Create());
services.AddKeyedSingleton(Indexing.Workflows.CreateCommunitiesWorkflow.Name, static (_, _) => Indexing.Workflows.CreateCommunitiesWorkflow.Create());
services.AddKeyedSingleton(Indexing.Workflows.CommunitySummariesWorkflow.Name, static (_, _) => Indexing.Workflows.CommunitySummariesWorkflow.Create());
diff --git a/tests/ManagedCode.GraphRag.Tests/Infrastructure/StubEmbeddingGenerator.cs b/tests/ManagedCode.GraphRag.Tests/Infrastructure/StubEmbeddingGenerator.cs
new file mode 100644
index 0000000000..3361735853
--- /dev/null
+++ b/tests/ManagedCode.GraphRag.Tests/Infrastructure/StubEmbeddingGenerator.cs
@@ -0,0 +1,53 @@
+using Microsoft.Extensions.AI;
+
+namespace ManagedCode.GraphRag.Tests.Infrastructure;
+
+internal sealed class StubEmbeddingGenerator : IEmbeddingGenerator>
+{
+ private readonly Dictionary _vectors;
+ private readonly float[] _fallback;
+
+ public StubEmbeddingGenerator(IDictionary? vectors = null)
+ {
+ _vectors = vectors is null
+ ? new Dictionary(StringComparer.OrdinalIgnoreCase)
+ : new Dictionary(vectors, StringComparer.OrdinalIgnoreCase);
+
+ _fallback = _vectors.Values.FirstOrDefault() ?? new[] { 0.5f, 0.5f, 0.5f };
+ }
+
+ public Task>> GenerateAsync(
+ IEnumerable values,
+ EmbeddingGenerationOptions? options = null,
+ CancellationToken cancellationToken = default)
+ {
+ ArgumentNullException.ThrowIfNull(values);
+
+ var embeddings = new List>();
+
+ foreach (var value in values)
+ {
+ cancellationToken.ThrowIfCancellationRequested();
+ var vector = ResolveVector(value);
+ embeddings.Add(new Embedding(new ReadOnlyMemory(vector)));
+ }
+
+ return Task.FromResult(new GeneratedEmbeddings>(embeddings));
+ }
+
+ public object? GetService(Type serviceType, object? serviceKey = null) => null;
+
+ public void Dispose()
+ {
+ }
+
+ private float[] ResolveVector(string? value)
+ {
+ if (!string.IsNullOrWhiteSpace(value) && _vectors.TryGetValue(value, out var vector))
+ {
+ return vector;
+ }
+
+ return _fallback;
+ }
+}
diff --git a/tests/ManagedCode.GraphRag.Tests/Integration/HeuristicMaintenanceIntegrationTests.cs b/tests/ManagedCode.GraphRag.Tests/Integration/HeuristicMaintenanceIntegrationTests.cs
new file mode 100644
index 0000000000..cf88d7f318
--- /dev/null
+++ b/tests/ManagedCode.GraphRag.Tests/Integration/HeuristicMaintenanceIntegrationTests.cs
@@ -0,0 +1,332 @@
+using System.Collections.Immutable;
+using GraphRag;
+using GraphRag.Callbacks;
+using GraphRag.Community;
+using GraphRag.Config;
+using GraphRag.Constants;
+using GraphRag.Data;
+using GraphRag.Entities;
+using GraphRag.Indexing.Runtime;
+using GraphRag.Indexing.Workflows;
+using GraphRag.Relationships;
+using GraphRag.Storage;
+using ManagedCode.GraphRag.Tests.Infrastructure;
+using Microsoft.Extensions.AI;
+using Microsoft.Extensions.DependencyInjection;
+
+namespace ManagedCode.GraphRag.Tests.Integration;
+
+public sealed class HeuristicMaintenanceIntegrationTests : IDisposable
+{
+ private readonly string _rootDir;
+
+ public HeuristicMaintenanceIntegrationTests()
+ {
+ _rootDir = Path.Combine(Path.GetTempPath(), "GraphRag", Guid.NewGuid().ToString("N"));
+ Directory.CreateDirectory(_rootDir);
+ }
+
+ [Fact]
+ public async Task HeuristicMaintenanceWorkflow_AppliesBudgetsAndSemanticDeduplication()
+ {
+ var outputDir = PrepareDirectory("output-maintenance");
+ var inputDir = PrepareDirectory("input-maintenance");
+ var previousDir = PrepareDirectory("previous-maintenance");
+
+ var textUnits = new[]
+ {
+ new TextUnitRecord
+ {
+ Id = "a",
+ Text = "Alpha Beta",
+ TokenCount = 40,
+ DocumentIds = new[] { "doc-1" },
+ EntityIds = Array.Empty(),
+ RelationshipIds = Array.Empty(),
+ CovariateIds = Array.Empty()
+ },
+ new TextUnitRecord
+ {
+ Id = "b",
+ Text = "Gamma Delta",
+ TokenCount = 30,
+ DocumentIds = new[] { "doc-1" },
+ EntityIds = Array.Empty(),
+ RelationshipIds = Array.Empty(),
+ CovariateIds = Array.Empty()
+ },
+ new TextUnitRecord
+ {
+ Id = "c",
+ Text = "Trim me",
+ TokenCount = 30,
+ DocumentIds = new[] { "doc-1" },
+ EntityIds = Array.Empty(),
+ RelationshipIds = Array.Empty(),
+ CovariateIds = Array.Empty()
+ },
+ new TextUnitRecord
+ {
+ Id = "d",
+ Text = "Alpha Beta",
+ TokenCount = 35,
+ DocumentIds = new[] { "doc-2" },
+ EntityIds = Array.Empty(),
+ RelationshipIds = Array.Empty(),
+ CovariateIds = Array.Empty()
+ }
+ };
+
+ var outputStorage = new FilePipelineStorage(outputDir);
+ await outputStorage.WriteTableAsync(PipelineTableNames.TextUnits, textUnits);
+
+ var embeddingVectors = new Dictionary
+ {
+ ["Alpha Beta"] = new[] { 1f, 0f },
+ ["Gamma Delta"] = new[] { 0f, 1f }
+ };
+
+ using var services = new ServiceCollection()
+ .AddLogging()
+ .AddSingleton(new TestChatClientFactory().CreateClient())
+ .AddSingleton>>(new StubEmbeddingGenerator(embeddingVectors))
+ .AddKeyedSingleton>>("dedupe-model", (sp, _) => sp.GetRequiredService>>())
+ .AddGraphRag()
+ .BuildServiceProvider();
+
+ var config = new GraphRagConfig
+ {
+ Heuristics = new HeuristicMaintenanceConfig
+ {
+ MaxTokensPerTextUnit = 50,
+ MaxDocumentTokenBudget = 80,
+ EnableSemanticDeduplication = true,
+ SemanticDeduplicationThreshold = 0.75,
+ EmbeddingModelId = "dedupe-model"
+ }
+ };
+
+ var context = new PipelineRunContext(
+ inputStorage: new FilePipelineStorage(inputDir),
+ outputStorage: outputStorage,
+ previousStorage: new FilePipelineStorage(previousDir),
+ cache: new StubPipelineCache(),
+ callbacks: NoopWorkflowCallbacks.Instance,
+ stats: new PipelineRunStats(),
+ state: new PipelineState(),
+ services: services);
+
+ var workflow = HeuristicMaintenanceWorkflow.Create();
+ await workflow(config, context, CancellationToken.None);
+
+ var processed = await outputStorage.LoadTableAsync(PipelineTableNames.TextUnits);
+ Assert.Equal(2, processed.Count);
+
+ var merged = Assert.Single(processed, unit => unit.Id == "a");
+ Assert.Equal(2, merged.DocumentIds.Count);
+ Assert.Contains("doc-1", merged.DocumentIds, StringComparer.OrdinalIgnoreCase);
+ Assert.Contains("doc-2", merged.DocumentIds, StringComparer.OrdinalIgnoreCase);
+ Assert.Equal(75, merged.TokenCount);
+
+ var survivor = Assert.Single(processed, unit => unit.Id == "b");
+ Assert.Single(survivor.DocumentIds);
+ Assert.Equal("doc-1", survivor.DocumentIds[0]);
+ Assert.DoesNotContain(processed, unit => unit.Id == "c");
+ Assert.DoesNotContain(processed, unit => unit.Id == "d" && unit.DocumentIds.Count == 1);
+ }
+
+ [Fact]
+ public async Task ExtractGraphWorkflow_LinksOrphansAndEnforcesRelationshipFloors()
+ {
+ var outputDir = PrepareDirectory("output-graph");
+ var inputDir = PrepareDirectory("input-graph");
+ var previousDir = PrepareDirectory("previous-graph");
+
+ var outputStorage = new FilePipelineStorage(outputDir);
+ await outputStorage.WriteTableAsync(PipelineTableNames.TextUnits, new[]
+ {
+ new TextUnitRecord
+ {
+ Id = "unit-1",
+ Text = "Alice collaborates with Bob on research.",
+ TokenCount = 12,
+ DocumentIds = new[] { "doc-1" },
+ EntityIds = Array.Empty(),
+ RelationshipIds = Array.Empty(),
+ CovariateIds = Array.Empty()
+ },
+ new TextUnitRecord
+ {
+ Id = "unit-2",
+ Text = "Charlie and Alice planned a workshop.",
+ TokenCount = 18,
+ DocumentIds = new[] { "doc-1" },
+ EntityIds = Array.Empty(),
+ RelationshipIds = Array.Empty(),
+ CovariateIds = Array.Empty()
+ }
+ });
+
+ var responses = new Queue(new[]
+ {
+ "{\"entities\": [ { \"title\": \"Alice\", \"type\": \"person\", \"description\": \"Researcher\", \"confidence\": 0.9 }, { \"title\": \"Bob\", \"type\": \"person\", \"description\": \"Engineer\", \"confidence\": 0.6 } ], \"relationships\": [ { \"source\": \"Alice\", \"target\": \"Bob\", \"type\": \"collaborates\", \"description\": \"Works together\", \"weight\": 0.1, \"bidirectional\": false } ] }",
+ "{\"entities\": [ { \"title\": \"Alice\", \"type\": \"person\", \"description\": \"Researcher\", \"confidence\": 0.8 }, { \"title\": \"Charlie\", \"type\": \"person\", \"description\": \"Analyst\", \"confidence\": 0.7 } ], \"relationships\": [] }"
+ });
+
+ using var services = new ServiceCollection()
+ .AddLogging()
+ .AddSingleton(new TestChatClientFactory(_ =>
+ {
+ if (responses.Count == 0)
+ {
+ throw new InvalidOperationException("No chat responses remaining.");
+ }
+
+ var payload = responses.Dequeue();
+ return new ChatResponse(new ChatMessage(ChatRole.Assistant, payload));
+ }).CreateClient())
+ .AddGraphRag()
+ .BuildServiceProvider();
+
+ var config = new GraphRagConfig
+ {
+ Heuristics = new HeuristicMaintenanceConfig
+ {
+ LinkOrphanEntities = true,
+ OrphanLinkWeight = 0.5,
+ MaxTextUnitsPerRelationship = 1,
+ RelationshipConfidenceFloor = 0.4
+ }
+ };
+
+ var context = new PipelineRunContext(
+ inputStorage: new FilePipelineStorage(inputDir),
+ outputStorage: outputStorage,
+ previousStorage: new FilePipelineStorage(previousDir),
+ cache: new StubPipelineCache(),
+ callbacks: NoopWorkflowCallbacks.Instance,
+ stats: new PipelineRunStats(),
+ state: new PipelineState(),
+ services: services);
+
+ var workflow = ExtractGraphWorkflow.Create();
+ await workflow(config, context, CancellationToken.None);
+
+ var relationships = await outputStorage.LoadTableAsync(PipelineTableNames.Relationships);
+ Assert.Equal(2, relationships.Count);
+
+ var direct = Assert.Single(relationships, rel => rel.Source == "Alice" && rel.Target == "Bob");
+ Assert.Equal(0.4, direct.Weight, 3);
+ Assert.Contains("unit-1", direct.TextUnitIds);
+ Assert.False(direct.Bidirectional);
+
+ var synthetic = Assert.Single(relationships, rel => rel.Source == "Charlie" && rel.Target == "Alice");
+ Assert.True(synthetic.Bidirectional);
+ Assert.Equal(0.5, synthetic.Weight, 3);
+ var orphanUnit = Assert.Single(synthetic.TextUnitIds);
+ Assert.Equal("unit-2", orphanUnit);
+
+ var entities = await outputStorage.LoadTableAsync(PipelineTableNames.Entities);
+ Assert.Equal(3, entities.Count);
+ Assert.Contains(entities, entity => entity.Title == "Charlie");
+ }
+
+ [Fact]
+ public async Task CreateCommunitiesWorkflow_UsesFastLabelPropagationAssignments()
+ {
+ var outputDir = PrepareDirectory("output-communities");
+ var inputDir = PrepareDirectory("input-communities");
+ var previousDir = PrepareDirectory("previous-communities");
+
+ var outputStorage = new FilePipelineStorage(outputDir);
+
+ var entities = new[]
+ {
+ new EntityRecord("entity-1", 0, "Alice", "Person", "Researcher", ImmutableArray.Create("unit-1"), 2, 2, 0, 0),
+ new EntityRecord("entity-2", 1, "Bob", "Person", "Engineer", ImmutableArray.Create("unit-1"), 2, 2, 0, 0),
+ new EntityRecord("entity-3", 2, "Charlie", "Person", "Analyst", ImmutableArray.Create("unit-2"), 2, 1, 0, 0),
+ new EntityRecord("entity-4", 3, "Diana", "Person", "Strategist", ImmutableArray.Create("unit-3"), 2, 1, 0, 0),
+ new EntityRecord("entity-5", 4, "Eve", "Person", "Planner", ImmutableArray.Create("unit-3"), 2, 1, 0, 0)
+ };
+
+ await outputStorage.WriteTableAsync(PipelineTableNames.Entities, entities);
+
+ var relationships = new[]
+ {
+ new RelationshipRecord("rel-1", 0, "Alice", "Bob", "collaborates", "", 0.9, 2, ImmutableArray.Create("unit-1"), true),
+ new RelationshipRecord("rel-2", 1, "Bob", "Charlie", "supports", "", 0.85, 2, ImmutableArray.Create("unit-2"), true),
+ new RelationshipRecord("rel-3", 2, "Diana", "Eve", "partners", "", 0.95, 2, ImmutableArray.Create("unit-3"), true)
+ };
+
+ await outputStorage.WriteTableAsync(PipelineTableNames.Relationships, relationships);
+
+ using var services = new ServiceCollection()
+ .AddLogging()
+ .AddSingleton(new TestChatClientFactory().CreateClient())
+ .AddGraphRag()
+ .BuildServiceProvider();
+
+ var config = new GraphRagConfig
+ {
+ ClusterGraph = new ClusterGraphConfig
+ {
+ Algorithm = CommunityDetectionAlgorithm.FastLabelPropagation,
+ MaxIterations = 8,
+ MaxClusterSize = 10,
+ Seed = 13,
+ UseLargestConnectedComponent = false
+ }
+ };
+
+ var context = new PipelineRunContext(
+ inputStorage: new FilePipelineStorage(inputDir),
+ outputStorage: outputStorage,
+ previousStorage: new FilePipelineStorage(previousDir),
+ cache: new StubPipelineCache(),
+ callbacks: NoopWorkflowCallbacks.Instance,
+ stats: new PipelineRunStats(),
+ state: new PipelineState(),
+ services: services);
+
+ var workflow = CreateCommunitiesWorkflow.Create();
+ await workflow(config, context, CancellationToken.None);
+
+ var communities = await outputStorage.LoadTableAsync(PipelineTableNames.Communities);
+ Assert.Equal(2, communities.Count);
+ Assert.Equal(communities.Count, Assert.IsType(context.Items["create_communities:count"]));
+
+ var titleLookup = entities.ToDictionary(entity => entity.Id, entity => entity.Title, StringComparer.OrdinalIgnoreCase);
+
+ var members = communities
+ .Select(community => community.EntityIds
+ .Select(id => titleLookup[id])
+ .OrderBy(title => title, StringComparer.OrdinalIgnoreCase)
+ .ToArray())
+ .ToList();
+
+ Assert.Contains(members, group => group.SequenceEqual(new[] { "Alice", "Bob", "Charlie" }));
+ Assert.Contains(members, group => group.SequenceEqual(new[] { "Diana", "Eve" }));
+ }
+
+ public void Dispose()
+ {
+ try
+ {
+ if (Directory.Exists(_rootDir))
+ {
+ Directory.Delete(_rootDir, recursive: true);
+ }
+ }
+ catch
+ {
+ // Ignore cleanup errors in tests.
+ }
+ }
+
+ private string PrepareDirectory(string name)
+ {
+ var path = Path.Combine(_rootDir, name);
+ Directory.CreateDirectory(path);
+ return path;
+ }
+}