Skip to content

Commit 62be244

Browse files
authored
Rag v8 (#176)
* RAG * quick versions of tasks as functions now pass input to run not the constructor which means no defaults and cloning * [refactor] Removed unnecessary checks for undefined values when copying additional input properties. * [refactor] Enhance tasks with service registry integration * [feat] Introduce input resolver system for enhanced schema handling * [feat] Introduce new AI tasks * [refactor] Remove ArrayTask from between JobQueueTask and Task. Refactor AI task schemas to simplify model handling * [refactor] Remove Provenance from task and task graph * Refactor document handling and introduce DocumentRepository * Refactor vector search methods to use similaritySearch * [refactor] Update constructor signatures in AiTask and JobQueueTask to use Partial<Input> like their parents. * [refactor] Enhance TextEmbeddingTask and HFT_JobRunFns and TFMP_JobRunFns for array input text handling * [refactor] Update JsonSchema and TypedArray so typed arrays are deserialized * [refactor] Update tabular repository implementations to support TypedArraySchemaOptions, simplify overall
1 parent 37a8753 commit 62be244

173 files changed

Lines changed: 15263 additions & 1924 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.cursor/rules/testing.mdc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ alwaysApply: true
44

55
Use bun, not jest, not npm, not node
66

7-
To run tests: `bun test`. To run specific ones: `bun run <testfilename>`.
7+
To run tests: `bun test`. To run specific ones: `bun test <testfilename>`.

bun.lock

Lines changed: 33 additions & 29 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/background/06_run_graph_orchestration.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ The editor DAG is defined by the end user and saved in the database (tasks and d
1010

1111
## Graph
1212

13-
The graph is a DAG. It is a list of tasks and a list of dataflows. The tasks are the nodes and the dataflows are the connections between task outputs and inputs, plus status and provenance.
13+
The graph is a DAG. It is a list of tasks and a list of dataflows. The tasks are the nodes and the dataflows are the connections between task outputs and inputs, plus status.
1414

1515
We expose events for graphs, tasks, and dataflows. A suspend/resume could be added for bulk creation. This helps keep a UI in sync as the graph runs.
1616

docs/background/07_intrumentation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,5 @@ Instrumentation is the process of adding code to a program to collect data about
77
Some of these tools cost money, so we need to track and estimate costs.
88

99
- Tasks emit status/progress events (`TaskStatus`, progress percent)
10-
- Dataflows emit start/complete/error events and carry provenance
10+
- Dataflows emit start/complete/error events
1111
- Task graphs emit start/progress/complete/error events

docs/developers/02_architecture.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -242,11 +242,10 @@ classDiagram
242242
243243
class TaskGraphRunner{
244244
Map<number, Task[]> layers
245-
Map<unknown, TaskInput> provenanceInput
246245
TaskGraph dag
247246
TaskOutputRepository repository
248247
assignLayers(Task[] sortedNodes)
249-
runGraph(TaskInput parentProvenance) TaskOutput
248+
runGraph(TaskInput input) TaskOutput
250249
runGraphReactive() TaskOutput
251250
}
252251
@@ -255,7 +254,6 @@ classDiagram
255254
The TaskGraphRunner is responsible for executing tasks in a task graph. Key features include:
256255

257256
- **Layer-based Execution**: Tasks are organized into layers based on dependencies, allowing parallel execution of independent tasks
258-
- **Provenance Tracking**: Tracks the lineage and input data that led to each task's output
259257
- **Caching Support**: Can use a TaskOutputRepository to cache task outputs and avoid re-running tasks
260258
- **Reactive Mode**: Supports reactive execution where tasks can respond to input changes without full re-execution
261259
- **Smart Task Scheduling**: Automatically determines task execution order based on dependencies

docs/developers/03_extending.md

Lines changed: 240 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ This document covers how to write your own tasks. For a more practical guide to
66
- [Tasks must have a `run()` method](#tasks-must-have-a-run-method)
77
- [Define Inputs and Outputs](#define-inputs-and-outputs)
88
- [Register the Task](#register-the-task)
9+
- [Schema Format Annotations](#schema-format-annotations)
910
- [Job Queues and LLM tasks](#job-queues-and-llm-tasks)
1011
- [Write a new Compound Task](#write-a-new-compound-task)
1112
- [Reactive Task UIs](#reactive-task-uis)
@@ -117,7 +118,7 @@ To use the Task in Workflow, there are a few steps:
117118

118119
```ts
119120
export const simpleDebug = (input: DebugLogTaskInput) => {
120-
return new SimpleDebugTask(input).run();
121+
return new SimpleDebugTask({} as DebugLogTaskInput, {}).run(input);
121122
};
122123

123124
declare module "@workglow/task-graph" {
@@ -129,6 +130,103 @@ declare module "@workglow/task-graph" {
129130
Workflow.prototype.simpleDebug = CreateWorkflow(SimpleDebugTask);
130131
```
131132

133+
## Schema Format Annotations
134+
135+
When defining task input schemas, you can use `format` annotations to enable automatic resolution of string identifiers to object instances. The TaskRunner inspects input schemas and resolves annotated string values before task execution.
136+
137+
### Built-in Format Annotations
138+
139+
The system supports several format annotations out of the box:
140+
141+
| Format | Description | Helper Function |
142+
| --------------------- | ----------------------------------- | -------------------------- |
143+
| `model` | Any AI model configuration ||
144+
| `model:TaskName` | Model compatible with specific task ||
145+
| `repository:tabular` | Tabular data repository | `TypeTabularRepository()` |
146+
| `repository:vector` | Vector storage repository | `TypeVectorRepository()` |
147+
| `repository:document` | Document repository | `TypeDocumentRepository()` |
148+
149+
### Example: Using Format Annotations
150+
151+
```typescript
152+
import { Task, type DataPortSchema } from "@workglow/task-graph";
153+
import { TypeTabularRepository } from "@workglow/storage";
154+
import { FromSchema } from "@workglow/util";
155+
156+
const MyTaskInputSchema = {
157+
type: "object",
158+
properties: {
159+
// Model input - accepts string ID or ModelConfig object
160+
model: {
161+
title: "AI Model",
162+
description: "Model for text generation",
163+
format: "model:TextGenerationTask",
164+
oneOf: [
165+
{ type: "string", title: "Model ID" },
166+
{ type: "object", title: "Model Config" },
167+
],
168+
},
169+
// Repository input - uses helper function
170+
dataSource: TypeTabularRepository({
171+
title: "Data Source",
172+
description: "Repository containing source data",
173+
}),
174+
// Regular string input (no resolution)
175+
prompt: { type: "string", title: "Prompt" },
176+
},
177+
required: ["model", "dataSource", "prompt"],
178+
} as const satisfies DataPortSchema;
179+
180+
type MyTaskInput = FromSchema<typeof MyTaskInputSchema>;
181+
182+
export class MyTask extends Task<MyTaskInput> {
183+
static readonly type = "MyTask";
184+
static inputSchema = () => MyTaskInputSchema;
185+
186+
async executeReactive(input: MyTaskInput) {
187+
// By the time execute runs, model is a ModelConfig object
188+
// and dataSource is an ITabularRepository instance
189+
const { model, dataSource, prompt } = input;
190+
// ...
191+
}
192+
}
193+
```
194+
195+
### Creating Custom Format Resolvers
196+
197+
You can extend the resolution system by registering custom resolvers:
198+
199+
```typescript
200+
import { registerInputResolver } from "@workglow/util";
201+
202+
// Register a resolver for "template:*" formats
203+
registerInputResolver("template", async (id, format, registry) => {
204+
const templateRepo = registry.get(TEMPLATE_REPOSITORY);
205+
const template = await templateRepo.findById(id);
206+
if (!template) {
207+
throw new Error(`Template "${id}" not found`);
208+
}
209+
return template;
210+
});
211+
```
212+
213+
Then use it in your schemas:
214+
215+
```typescript
216+
const inputSchema = {
217+
type: "object",
218+
properties: {
219+
emailTemplate: {
220+
type: "string",
221+
format: "template:email",
222+
title: "Email Template",
223+
},
224+
},
225+
};
226+
```
227+
228+
When a task runs with `{ emailTemplate: "welcome-email" }`, the resolver automatically converts it to the template object before execution.
229+
132230
## Job Queues and LLM tasks
133231

134232
We separate any long running tasks as Jobs. Jobs could potentially be run anywhere, either locally in the same thread, in separate threads, or on a remote server. A job queue will manage these for a single provider (like OpenAI, or a local Transformers.js ONNX runtime), and handle backoff, retries, etc.
@@ -148,3 +246,144 @@ Compound Tasks are not cached (though any or all of their children may be).
148246
## Reactive Task UIs
149247

150248
Tasks can be reactive at a certain level. This means that they can be triggered by changes in the data they depend on, without "running" the expensive job based task runs. This is useful for a UI node editor. For example, you change a color in one task and it is propagated downstream without incurring costs for re-running the entire graph. It is like a spreadsheet where changing a cell can trigger a recalculation of other cells. This is implemented via a `runReactive()` method that is called when the data changes. Typically, the `run()` will call `runReactive()` on itself at the end of the method.
249+
250+
## AI and RAG Tasks
251+
252+
The `@workglow/ai` package provides a comprehensive set of tasks for building RAG (Retrieval-Augmented Generation) pipelines. These tasks are designed to chain together in workflows without requiring external loops.
253+
254+
### Document Processing Tasks
255+
256+
| Task | Description |
257+
| ------------------------- | ----------------------------------------------------- |
258+
| `StructuralParserTask` | Parses markdown/text into hierarchical document trees |
259+
| `TextChunkerTask` | Splits text into chunks with configurable strategies |
260+
| `HierarchicalChunkerTask` | Token-aware chunking that respects document structure |
261+
| `TopicSegmenterTask` | Segments text by topic using heuristics or embeddings |
262+
| `DocumentEnricherTask` | Adds summaries and entities to document nodes |
263+
264+
### Vector and Embedding Tasks
265+
266+
| Task | Description |
267+
| ------------------------------ | ---------------------------------------------- |
268+
| `TextEmbeddingTask` | Generates embeddings using configurable models |
269+
| `ChunkToVectorTask` | Transforms chunks to vector store format |
270+
| `DocumentNodeVectorUpsertTask` | Stores vectors in a repository |
271+
| `DocumentNodeVectorSearchTask` | Searches vectors by similarity |
272+
| `VectorQuantizeTask` | Quantizes vectors for storage efficiency |
273+
274+
### Retrieval and Generation Tasks
275+
276+
| Task | Description |
277+
| ------------------------------------ | --------------------------------------------- |
278+
| `QueryExpanderTask` | Expands queries for better retrieval coverage |
279+
| `DocumentNodeVectorHybridSearchTask` | Combines vector and full-text search |
280+
| `RerankerTask` | Reranks search results for relevance |
281+
| `HierarchyJoinTask` | Enriches results with parent context |
282+
| `ContextBuilderTask` | Builds context for LLM prompts |
283+
| `DocumentNodeRetrievalTask` | Orchestrates end-to-end retrieval |
284+
| `TextQuestionAnswerTask` | Generates answers from context |
285+
| `TextGenerationTask` | General text generation |
286+
287+
### Chainable RAG Pipeline Example
288+
289+
Tasks chain together through compatible input/output schemas:
290+
291+
```typescript
292+
import { Workflow } from "@workglow/task-graph";
293+
import { InMemoryVectorRepository } from "@workglow/storage";
294+
295+
const vectorRepo = new InMemoryVectorRepository();
296+
await vectorRepo.setupDatabase();
297+
298+
// Document ingestion pipeline
299+
await new Workflow()
300+
.structuralParser({
301+
text: markdownContent,
302+
title: "My Document",
303+
format: "markdown",
304+
})
305+
.documentEnricher({
306+
generateSummaries: true,
307+
extractEntities: true,
308+
})
309+
.hierarchicalChunker({
310+
maxTokens: 512,
311+
overlap: 50,
312+
strategy: "hierarchical",
313+
})
314+
.textEmbedding({
315+
model: "Xenova/all-MiniLM-L6-v2",
316+
})
317+
.chunkToVector()
318+
.vectorStoreUpsert({
319+
repository: vectorRepo,
320+
})
321+
.run();
322+
```
323+
324+
### Retrieval Pipeline Example
325+
326+
```typescript
327+
const answer = await new Workflow()
328+
.textEmbedding({
329+
text: query,
330+
model: "Xenova/all-MiniLM-L6-v2",
331+
})
332+
.vectorStoreSearch({
333+
repository: vectorRepo,
334+
topK: 10,
335+
})
336+
.reranker({
337+
query,
338+
topK: 5,
339+
})
340+
.contextBuilder({
341+
format: "markdown",
342+
maxLength: 2000,
343+
})
344+
.textQuestionAnswer({
345+
question: query,
346+
model: "Xenova/LaMini-Flan-T5-783M",
347+
})
348+
.run();
349+
```
350+
351+
### Hierarchical Document Structure
352+
353+
Documents are represented as trees with typed nodes:
354+
355+
```typescript
356+
type DocumentNode =
357+
| DocumentRootNode // Root of document
358+
| SectionNode // Headers, structural sections
359+
| ParagraphNode // Text blocks
360+
| SentenceNode // Fine-grained (optional)
361+
| TopicNode; // Detected topic segments
362+
363+
// Each node contains:
364+
interface BaseNode {
365+
nodeId: string; // Deterministic content-based ID
366+
range: { start: number; end: number };
367+
text: string;
368+
enrichment?: {
369+
summary?: string;
370+
entities?: Entity[];
371+
keywords?: string[];
372+
};
373+
}
374+
```
375+
376+
### Task Data Flow
377+
378+
Each task passes through what the next task needs:
379+
380+
| Task | Passes Through | Adds |
381+
| --------------------- | ------------------------ | ------------------------------------- |
382+
| `structuralParser` | - | `doc_id`, `documentTree`, `nodeCount` |
383+
| `documentEnricher` | `doc_id`, `documentTree` | `summaryCount`, `entityCount` |
384+
| `hierarchicalChunker` | `doc_id` | `chunks`, `text[]`, `count` |
385+
| `textEmbedding` | (implicit) | `vector[]` |
386+
| `chunkToVector` | - | `ids[]`, `vectors[]`, `metadata[]` |
387+
| `vectorStoreUpsert` | - | `count`, `ids` |
388+
389+
This design eliminates the need for external loops - the entire pipeline chains together naturally.

package.json

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
"publish": "bun ./scripts/publish-workspaces.ts"
3333
},
3434
"dependencies": {
35-
"caniuse-lite": "^1.0.30001761"
35+
"caniuse-lite": "^1.0.30001763"
3636
},
3737
"catalog": {
3838
"@sroussey/transformers": "3.8.2",
@@ -44,17 +44,17 @@
4444
"devDependencies": {
4545
"@sroussey/changesets-cli": "^2.29.7",
4646
"@types/bun": "^1.3.5",
47-
"@typescript-eslint/eslint-plugin": "^8.50.1",
48-
"@typescript-eslint/parser": "^8.50.1",
47+
"@typescript-eslint/eslint-plugin": "^8.52.0",
48+
"@typescript-eslint/parser": "^8.52.0",
4949
"concurrently": "^9.2.1",
5050
"eslint": "^9.39.2",
5151
"eslint-plugin-jsx-a11y": "^6.10.2",
5252
"eslint-plugin-react": "^7.37.5",
5353
"eslint-plugin-react-hooks": "^7.0.1",
5454
"eslint-plugin-regexp": "^2.10.0",
55-
"globals": "^16.5.0",
55+
"globals": "^17.0.0",
5656
"prettier": "^3.7.4",
57-
"turbo": "^2.7.2",
57+
"turbo": "^2.7.3",
5858
"typescript": "5.9.3",
5959
"vitest": "^4.0.16"
6060
},

packages/ai-provider/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@ const task = new TextEmbeddingTask({
138138
});
139139

140140
const result = await task.execute();
141-
// result.vector: TypedArray - Vector embedding
141+
// result.vector: Vector - Vector embedding
142142
```
143143

144144
**Text Translation:**

0 commit comments

Comments
 (0)