Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
9bb4e63
RAG, no provenance
sroussey Jan 6, 2026
c0b2205
Refactor document handling and introduce DocumentRepository
sroussey Jan 6, 2026
7eb4ca0
Refactor vector search methods to use similaritySearch
sroussey Jan 6, 2026
80a35d0
[refactor] Update constructor signatures in AiTask and JobQueueTask t…
sroussey Jan 6, 2026
ffcc8b9
[refactor] move some type utilities around
sroussey Jan 6, 2026
4fbf12e
[chore] Update dependencies in bun.lock and package.json
sroussey Jan 8, 2026
19c79d9
[feat] Add NeuroBERT NER model sample to ONNXModelSamples
sroussey Jan 8, 2026
d83f6b2
[refactor] Enhance TextEmbeddingTask and HFT_JobRunFns for array inpu…
sroussey Jan 8, 2026
5a52603
[refactor] Enhance Task and Workflow classes for improved schema vali…
sroussey Jan 8, 2026
88e6de9
[feat] Enhance TFMP_TextEmbedding to support array input handling
sroussey Jan 9, 2026
f394f08
[refactor] Update JsonSchema and TypedArray so typed arrays are deser…
sroussey Jan 9, 2026
428d7a9
[refactor] Remove EdgeVecRepository for now
sroussey Jan 9, 2026
b73d90f
[refactor] Update tabular repository implementations to support Typed…
sroussey Jan 9, 2026
d24994d
[refactor] Revamp vector repository implementations for enhanced sche…
sroussey Jan 9, 2026
e7ac63c
[feat] Introduce new vector-related tasks for enhanced workflow tests
sroussey Jan 9, 2026
b361c57
[fix] for dimentions
sroussey Jan 9, 2026
b9216eb
[refactor] Clean up imports and simplify type definitions in Supabase…
sroussey Jan 10, 2026
8cc8617
[refactor] Update repository structure
sroussey Jan 10, 2026
759d631
[refactor] Standardize imports and update type definitions across tasks
sroussey Jan 10, 2026
f8ce8d5
[refactor] move tests
sroussey Jan 10, 2026
05cff0a
[refactor] Simplify task and repository structures for document chunk…
sroussey Jan 11, 2026
c642396
[refactor] Revise vector handling and repository structure for docume…
sroussey Jan 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .cursor/rules/testing.mdc
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ alwaysApply: true

Use bun, not jest, not npm, not node

To run tests: `bun test`. To run specific ones: `bun run <testfilename>`.
To run tests: `bun test`. To run specific ones: `bun test <testfilename>`.
62 changes: 33 additions & 29 deletions bun.lock

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/background/06_run_graph_orchestration.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The editor DAG is defined by the end user and saved in the database (tasks and d

## Graph

The graph is a DAG. It is a list of tasks and a list of dataflows. The tasks are the nodes and the dataflows are the connections between task outputs and inputs, plus status and provenance.
The graph is a DAG. It is a list of tasks and a list of dataflows. The tasks are the nodes and the dataflows are the connections between task outputs and inputs, plus status.

We expose events for graphs, tasks, and dataflows. A suspend/resume could be added for bulk creation. This helps keep a UI in sync as the graph runs.

Expand Down
2 changes: 1 addition & 1 deletion docs/background/07_intrumentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,5 @@ Instrumentation is the process of adding code to a program to collect data about
Some of these tools cost money, so we need to track and estimate costs.

- Tasks emit status/progress events (`TaskStatus`, progress percent)
- Dataflows emit start/complete/error events and carry provenance
- Dataflows emit start/complete/error events
- Task graphs emit start/progress/complete/error events
4 changes: 1 addition & 3 deletions docs/developers/02_architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -242,11 +242,10 @@ classDiagram

class TaskGraphRunner{
Map<number, Task[]> layers
Map<unknown, TaskInput> provenanceInput
TaskGraph dag
TaskOutputRepository repository
assignLayers(Task[] sortedNodes)
runGraph(TaskInput parentProvenance) TaskOutput
runGraph(TaskInput input) TaskOutput
runGraphReactive() TaskOutput
}

Expand All @@ -255,7 +254,6 @@ classDiagram
The TaskGraphRunner is responsible for executing tasks in a task graph. Key features include:

- **Layer-based Execution**: Tasks are organized into layers based on dependencies, allowing parallel execution of independent tasks
- **Provenance Tracking**: Tracks the lineage and input data that led to each task's output
- **Caching Support**: Can use a TaskOutputRepository to cache task outputs and avoid re-running tasks
- **Reactive Mode**: Supports reactive execution where tasks can respond to input changes without full re-execution
- **Smart Task Scheduling**: Automatically determines task execution order based on dependencies
Expand Down
241 changes: 240 additions & 1 deletion docs/developers/03_extending.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ This document covers how to write your own tasks. For a more practical guide to
- [Tasks must have a `run()` method](#tasks-must-have-a-run-method)
- [Define Inputs and Outputs](#define-inputs-and-outputs)
- [Register the Task](#register-the-task)
- [Schema Format Annotations](#schema-format-annotations)
- [Job Queues and LLM tasks](#job-queues-and-llm-tasks)
- [Write a new Compound Task](#write-a-new-compound-task)
- [Reactive Task UIs](#reactive-task-uis)
Expand Down Expand Up @@ -117,7 +118,7 @@ To use the Task in Workflow, there are a few steps:

```ts
export const simpleDebug = (input: DebugLogTaskInput) => {
return new SimpleDebugTask(input).run();
return new SimpleDebugTask({} as DebugLogTaskInput, {}).run(input);
};

declare module "@workglow/task-graph" {
Expand All @@ -129,6 +130,103 @@ declare module "@workglow/task-graph" {
Workflow.prototype.simpleDebug = CreateWorkflow(SimpleDebugTask);
```

## Schema Format Annotations

When defining task input schemas, you can use `format` annotations to enable automatic resolution of string identifiers to object instances. The TaskRunner inspects input schemas and resolves annotated string values before task execution.

### Built-in Format Annotations

The system supports several format annotations out of the box:

| Format | Description | Helper Function |
| --------------------- | ----------------------------------- | -------------------------- |
| `model` | Any AI model configuration | — |
| `model:TaskName` | Model compatible with specific task | — |
| `repository:tabular` | Tabular data repository | `TypeTabularRepository()` |
| `repository:vector` | Vector storage repository | `TypeVectorRepository()` |
| `repository:document` | Document repository | `TypeDocumentRepository()` |

### Example: Using Format Annotations

```typescript
import { Task, type DataPortSchema } from "@workglow/task-graph";
import { TypeTabularRepository } from "@workglow/storage";
import { FromSchema } from "@workglow/util";

const MyTaskInputSchema = {
type: "object",
properties: {
// Model input - accepts string ID or ModelConfig object
model: {
title: "AI Model",
description: "Model for text generation",
format: "model:TextGenerationTask",
oneOf: [
{ type: "string", title: "Model ID" },
{ type: "object", title: "Model Config" },
],
},
// Repository input - uses helper function
dataSource: TypeTabularRepository({
title: "Data Source",
description: "Repository containing source data",
}),
// Regular string input (no resolution)
prompt: { type: "string", title: "Prompt" },
},
required: ["model", "dataSource", "prompt"],
} as const satisfies DataPortSchema;

type MyTaskInput = FromSchema<typeof MyTaskInputSchema>;

export class MyTask extends Task<MyTaskInput> {
static readonly type = "MyTask";
static inputSchema = () => MyTaskInputSchema;

async executeReactive(input: MyTaskInput) {
// By the time execute runs, model is a ModelConfig object
// and dataSource is an ITabularRepository instance
const { model, dataSource, prompt } = input;
// ...
}
}
```

### Creating Custom Format Resolvers

You can extend the resolution system by registering custom resolvers:

```typescript
import { registerInputResolver } from "@workglow/util";

// Register a resolver for "template:*" formats
registerInputResolver("template", async (id, format, registry) => {
const templateRepo = registry.get(TEMPLATE_REPOSITORY);
const template = await templateRepo.findById(id);
if (!template) {
throw new Error(`Template "${id}" not found`);
}
return template;
});
```

Then use it in your schemas:

```typescript
const inputSchema = {
type: "object",
properties: {
emailTemplate: {
type: "string",
format: "template:email",
title: "Email Template",
},
},
};
```

When a task runs with `{ emailTemplate: "welcome-email" }`, the resolver automatically converts it to the template object before execution.

## Job Queues and LLM tasks

We separate any long running tasks as Jobs. Jobs could potentially be run anywhere, either locally in the same thread, in separate threads, or on a remote server. A job queue will manage these for a single provider (like OpenAI, or a local Transformers.js ONNX runtime), and handle backoff, retries, etc.
Expand All @@ -148,3 +246,144 @@ Compound Tasks are not cached (though any or all of their children may be).
## Reactive Task UIs

Tasks can be reactive at a certain level. This means that they can be triggered by changes in the data they depend on, without "running" the expensive job based task runs. This is useful for a UI node editor. For example, you change a color in one task and it is propagated downstream without incurring costs for re-running the entire graph. It is like a spreadsheet where changing a cell can trigger a recalculation of other cells. This is implemented via a `runReactive()` method that is called when the data changes. Typically, the `run()` will call `runReactive()` on itself at the end of the method.

## AI and RAG Tasks

The `@workglow/ai` package provides a comprehensive set of tasks for building RAG (Retrieval-Augmented Generation) pipelines. These tasks are designed to chain together in workflows without requiring external loops.

### Document Processing Tasks

| Task | Description |
| ------------------------- | ----------------------------------------------------- |
| `StructuralParserTask` | Parses markdown/text into hierarchical document trees |
| `TextChunkerTask` | Splits text into chunks with configurable strategies |
| `HierarchicalChunkerTask` | Token-aware chunking that respects document structure |
| `TopicSegmenterTask` | Segments text by topic using heuristics or embeddings |
| `DocumentEnricherTask` | Adds summaries and entities to document nodes |

### Vector and Embedding Tasks

| Task | Description |
| ------------------------------ | ---------------------------------------------- |
| `TextEmbeddingTask` | Generates embeddings using configurable models |
| `ChunkToVectorTask` | Transforms chunks to vector store format |
| `DocumentNodeVectorUpsertTask` | Stores vectors in a repository |
| `DocumentNodeVectorSearchTask` | Searches vectors by similarity |
| `VectorQuantizeTask` | Quantizes vectors for storage efficiency |

### Retrieval and Generation Tasks

| Task | Description |
| ------------------------------------ | --------------------------------------------- |
| `QueryExpanderTask` | Expands queries for better retrieval coverage |
| `DocumentNodeVectorHybridSearchTask` | Combines vector and full-text search |
| `RerankerTask` | Reranks search results for relevance |
| `HierarchyJoinTask` | Enriches results with parent context |
| `ContextBuilderTask` | Builds context for LLM prompts |
| `DocumentNodeRetrievalTask` | Orchestrates end-to-end retrieval |
| `TextQuestionAnswerTask` | Generates answers from context |
| `TextGenerationTask` | General text generation |

### Chainable RAG Pipeline Example

Tasks chain together through compatible input/output schemas:

```typescript
import { Workflow } from "@workglow/task-graph";
import { InMemoryVectorRepository } from "@workglow/storage";

const vectorRepo = new InMemoryVectorRepository();
await vectorRepo.setupDatabase();

// Document ingestion pipeline
await new Workflow()
.structuralParser({
text: markdownContent,
title: "My Document",
format: "markdown",
})
.documentEnricher({
generateSummaries: true,
extractEntities: true,
})
.hierarchicalChunker({
maxTokens: 512,
overlap: 50,
strategy: "hierarchical",
})
.textEmbedding({
model: "Xenova/all-MiniLM-L6-v2",
})
.chunkToVector()
.vectorStoreUpsert({
repository: vectorRepo,
})
.run();
```

### Retrieval Pipeline Example

```typescript
const answer = await new Workflow()
.textEmbedding({
text: query,
model: "Xenova/all-MiniLM-L6-v2",
})
.vectorStoreSearch({
repository: vectorRepo,
topK: 10,
})
.reranker({
query,
topK: 5,
})
.contextBuilder({
format: "markdown",
maxLength: 2000,
})
.textQuestionAnswer({
question: query,
model: "Xenova/LaMini-Flan-T5-783M",
})
.run();
```

### Hierarchical Document Structure

Documents are represented as trees with typed nodes:

```typescript
type DocumentNode =
| DocumentRootNode // Root of document
| SectionNode // Headers, structural sections
| ParagraphNode // Text blocks
| SentenceNode // Fine-grained (optional)
| TopicNode; // Detected topic segments

// Each node contains:
interface BaseNode {
nodeId: string; // Deterministic content-based ID
range: { start: number; end: number };
text: string;
enrichment?: {
summary?: string;
entities?: Entity[];
keywords?: string[];
};
}
```

### Task Data Flow

Each task passes through what the next task needs:

| Task | Passes Through | Adds |
| --------------------- | ------------------------ | ------------------------------------- |
| `structuralParser` | - | `doc_id`, `documentTree`, `nodeCount` |
| `documentEnricher` | `doc_id`, `documentTree` | `summaryCount`, `entityCount` |
| `hierarchicalChunker` | `doc_id` | `chunks`, `text[]`, `count` |
| `textEmbedding` | (implicit) | `vector[]` |
| `chunkToVector` | - | `ids[]`, `vectors[]`, `metadata[]` |
| `vectorStoreUpsert` | - | `count`, `ids` |

This design eliminates the need for external loops - the entire pipeline chains together naturally.
10 changes: 5 additions & 5 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
"publish": "bun ./scripts/publish-workspaces.ts"
},
"dependencies": {
"caniuse-lite": "^1.0.30001761"
"caniuse-lite": "^1.0.30001763"
},
"catalog": {
"@sroussey/transformers": "3.8.2",
Expand All @@ -44,17 +44,17 @@
"devDependencies": {
"@sroussey/changesets-cli": "^2.29.7",
"@types/bun": "^1.3.5",
"@typescript-eslint/eslint-plugin": "^8.50.1",
"@typescript-eslint/parser": "^8.50.1",
"@typescript-eslint/eslint-plugin": "^8.52.0",
"@typescript-eslint/parser": "^8.52.0",
"concurrently": "^9.2.1",
"eslint": "^9.39.2",
"eslint-plugin-jsx-a11y": "^6.10.2",
"eslint-plugin-react": "^7.37.5",
"eslint-plugin-react-hooks": "^7.0.1",
"eslint-plugin-regexp": "^2.10.0",
"globals": "^16.5.0",
"globals": "^17.0.0",
"prettier": "^3.7.4",
"turbo": "^2.7.2",
"turbo": "^2.7.3",
"typescript": "5.9.3",
"vitest": "^4.0.16"
},
Expand Down
2 changes: 1 addition & 1 deletion packages/ai-provider/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ const task = new TextEmbeddingTask({
});

const result = await task.execute();
// result.vector: TypedArray - Vector embedding
// result.vector: Vector - Vector embedding
```

**Text Translation:**
Expand Down
Loading
Loading