-
Notifications
You must be signed in to change notification settings - Fork 1
[feat] Repo registries and RAG workflows #154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ved input data handling for tests - Replaced structuredClone and JSON methods with a new smartClone function that deep-clones plain objects and arrays while preserving class instances by reference. - quick versions of tasks as functions now pass input to run not the constructor which means no defaults and cloning
…ng additional input properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a new VectorQuantizeTask for efficient vector quantization and refactors vector utilities into reusable modules. The changes improve code organization by extracting common vector operations from VectorSimilarityTask into dedicated utility files.
- New VectorQuantizeTask supporting multiple quantization types (INT8, UINT8, INT16, UINT16, FLOAT16, FLOAT32, FLOAT64)
- Refactored vector utilities into VectorUtils and VectorSimilarityUtils modules for reusability
- Updated VectorSimilarityTask to use the new utility functions and renamed
similarityparameter tomethod
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/util/src/vector/VectorUtils.ts | New utility module providing magnitude, inner product, and normalize functions for vector operations |
| packages/util/src/vector/VectorSimilarityUtils.ts | New utility module with cosine, Jaccard, and Hamming similarity/distance calculations |
| packages/util/src/vector/TypedArray.ts | Type definitions and JSON schemas for supported typed array types (Float16/32/64, Int8/16, Uint8/16) |
| packages/util/src/vector/Tensor.ts | Schema definitions for tensor/vector data structures with type, data, shape, and normalization properties |
| packages/util/src/json-schema/SchemaValidation.ts | Updated import to use @sroussey/json-schema-library package |
| packages/util/src/common.ts | Added exports for new vector utility modules |
| packages/util/package.json | Updated dependency from json-schema-library to @sroussey/json-schema-library |
| packages/test/src/test/task/VectorQuantizeTask.test.ts | Comprehensive test suite for VectorQuantizeTask covering all quantization types and edge cases |
| packages/task-graph/src/task/Task.ts | Updated stripSymbols to preserve TypedArrays by detecting ArrayBuffer views |
| packages/ai/src/task/index.ts | Added export for VectorQuantizeTask |
| packages/ai/src/task/base/AiTaskSchemas.ts | Refactored to import TypedArray and related types from @workglow/util, removed duplicate definitions |
| packages/ai/src/task/VectorSimilarityTask.ts | Refactored to use imported similarity functions from @workglow/util, removed local implementations, renamed similarity parameter to method |
| packages/ai/src/task/VectorQuantizeTask.ts | New task implementing vector quantization with normalization and multiple target type support |
| packages/ai/src/task/TextEmbeddingTask.ts | Updated imports to use TypedArraySchema from @workglow/util |
| packages/ai/src/task/ImageEmbeddingTask.ts | Updated imports to use TypedArraySchema from @workglow/util |
| packages/ai-provider/src/hf-transformers/common/HFT_JobRunFns.ts | Updated to import TypedArray from @workglow/util instead of @workglow/ai |
| packages/ai-provider/README.md | Updated comment to use "Vector" instead of "TypedArray" in code example |
| bun.lock | Updated lockfile with new @sroussey/json-schema-library dependency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private quantizeToUint8(values: number[]): Uint8Array { | ||
| // Find min/max for scaling | ||
| const min = Math.min(...values); | ||
| const max = Math.max(...values); | ||
| const range = max - min || 1; | ||
|
|
||
| // Scale to [0, 255] | ||
| return new Uint8Array(values.map((v) => Math.round(((v - min) / range) * 255))); | ||
| } |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quantizeToUint8 and quantizeToUint16 methods use spread operator with Math.min/Math.max on the entire values array. For large vectors, this is inefficient as it creates multiple intermediate arrays. Consider using a single loop to find both min and max values simultaneously, which would be more performant and memory-efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| export function magnitude(arr: TypedArray | number[]): number { | ||
| // @ts-ignore - Vector reduce works but TS doesn't recognize it | ||
| return Math.sqrt(arr.reduce((acc, val) => acc + val * val, 0)); | ||
| } | ||
|
|
||
| /** | ||
| * Calculates the inner (dot) product of two vectors | ||
| */ | ||
| export function inner(arr1: TypedArray, arr2: TypedArray): number { | ||
| // @ts-ignore - Vector reduce works but TS doesn't recognize it | ||
| return arr1.reduce((acc, val, i) => acc + val * arr2[i], 0); | ||
| } | ||
|
|
||
| /** | ||
| * Normalizes a vector to unit length (L2 normalization) | ||
| * | ||
| * @param vector - The vector to normalize | ||
| * @param throwOnZero - If true, throws an error for zero vectors. If false, returns the original vector. | ||
| * @returns Normalized vector with the same type as input | ||
| */ | ||
| export function normalize(vector: TypedArray, throwOnZero = true): TypedArray { | ||
| const mag = magnitude(vector); | ||
|
|
||
| if (mag === 0) { | ||
| if (throwOnZero) { | ||
| throw new Error("Cannot normalize a zero vector."); | ||
| } | ||
| return vector; | ||
| } | ||
|
|
||
| const normalized = Array.from(vector).map((val) => Number(val) / mag); | ||
|
|
||
| // Preserve the original Vector type | ||
| if (vector instanceof Float64Array) { | ||
| return new Float64Array(normalized); | ||
| } | ||
| if (vector instanceof Float32Array) { | ||
| return new Float32Array(normalized); | ||
| } | ||
| if (vector instanceof Int8Array) { | ||
| return new Int8Array(normalized); | ||
| } | ||
| if (vector instanceof Uint8Array) { | ||
| return new Uint8Array(normalized); | ||
| } | ||
| if (vector instanceof Int16Array) { | ||
| return new Int16Array(normalized); | ||
| } | ||
| if (vector instanceof Uint16Array) { | ||
| return new Uint16Array(normalized); | ||
| } | ||
| // For other integer arrays, use Float32Array since normalization produces floats | ||
| return new Float32Array(normalized); | ||
| } | ||
|
|
||
| /** | ||
| * Normalizes an array of numbers to unit length (L2 normalization) | ||
| * | ||
| * @param values - The array of numbers to normalize | ||
| * @param throwOnZero - If true, throws an error for zero vectors. If false, returns the original array. | ||
| * @returns Normalized array of numbers | ||
| */ | ||
| export function normalizeNumberArray(values: number[], throwOnZero = false): number[] { | ||
| const norm = magnitude(values); | ||
|
|
||
| if (norm === 0) { | ||
| if (throwOnZero) { | ||
| throw new Error("Cannot normalize a zero vector."); | ||
| } | ||
| return values; | ||
| } | ||
|
|
||
| return values.map((v) => v / norm); | ||
| } |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newly introduced VectorUtils module (magnitude, inner, normalize, normalizeNumberArray functions) lacks test coverage. Given that the repository has comprehensive testing for other utility functions, these vector utility functions should also have tests to ensure correctness, especially for edge cases like zero vectors, different typed array types, and Float16Array handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| export function cosineSimilarity(a: TypedArray, b: TypedArray): number { | ||
| if (a.length !== b.length) { | ||
| throw new Error("Vectors must have the same length"); | ||
| } | ||
| let dotProduct = 0; | ||
| let normA = 0; | ||
| let normB = 0; | ||
| for (let i = 0; i < a.length; i++) { | ||
| dotProduct += a[i] * b[i]; | ||
| normA += a[i] * a[i]; | ||
| normB += b[i] * b[i]; | ||
| } | ||
| const denominator = Math.sqrt(normA) * Math.sqrt(normB); | ||
| if (denominator === 0) { | ||
| return 0; | ||
| } | ||
| return dotProduct / denominator; | ||
| } | ||
|
|
||
| /** | ||
| * Calculates Jaccard similarity between two vectors | ||
| * Uses the formula: sum(min(a[i], b[i])) / sum(max(a[i], b[i])) | ||
| * Returns a value between 0 and 1 | ||
| */ | ||
| export function jaccardSimilarity(a: TypedArray, b: TypedArray): number { | ||
| if (a.length !== b.length) { | ||
| throw new Error("Vectors must have the same length"); | ||
| } | ||
|
|
||
| let minSum = 0; | ||
| let maxSum = 0; | ||
|
|
||
| for (let i = 0; i < a.length; i++) { | ||
| minSum += Math.min(a[i], b[i]); | ||
| maxSum += Math.max(a[i], b[i]); | ||
| } | ||
|
|
||
| return maxSum === 0 ? 0 : minSum / maxSum; | ||
| } | ||
|
|
||
| /** | ||
| * Calculates Hamming distance between two vectors (normalized) | ||
| * Counts the number of positions where vectors differ | ||
| * Returns a value between 0 and 1 (0 = identical, 1 = completely different) | ||
| */ | ||
| export function hammingDistance(a: TypedArray, b: TypedArray): number { | ||
| if (a.length !== b.length) { | ||
| throw new Error("Vectors must have the same length"); | ||
| } | ||
|
|
||
| let differences = 0; | ||
|
|
||
| for (let i = 0; i < a.length; i++) { | ||
| if (a[i] !== b[i]) { | ||
| differences++; | ||
| } | ||
| } | ||
|
|
||
| return differences / a.length; | ||
| } | ||
|
|
||
| /** | ||
| * Calculates Hamming similarity (inverse of distance) | ||
| * Returns a value between 0 and 1 (1 = identical, 0 = completely different) | ||
| */ | ||
| export function hammingSimilarity(a: TypedArray, b: TypedArray): number { | ||
| return 1 - hammingDistance(a, b); | ||
| } |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newly introduced VectorSimilarityUtils module (cosineSimilarity, jaccardSimilarity, hammingDistance, hammingSimilarity functions) lacks test coverage. Given that the repository has comprehensive testing for other utility functions and these functions are now factored out from VectorSimilarityTask, they should have dedicated tests to ensure correctness across different typed array types and edge cases like zero vectors and mismatched lengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| // Preserve the original Vector type | ||
| if (vector instanceof Float64Array) { | ||
| return new Float64Array(normalized); | ||
| } | ||
| if (vector instanceof Float32Array) { | ||
| return new Float32Array(normalized); | ||
| } | ||
| if (vector instanceof Int8Array) { | ||
| return new Int8Array(normalized); | ||
| } | ||
| if (vector instanceof Uint8Array) { | ||
| return new Uint8Array(normalized); | ||
| } | ||
| if (vector instanceof Int16Array) { | ||
| return new Int16Array(normalized); | ||
| } | ||
| if (vector instanceof Uint16Array) { | ||
| return new Uint16Array(normalized); | ||
| } | ||
| // For other integer arrays, use Float32Array since normalization produces floats | ||
| return new Float32Array(normalized); | ||
| } |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normalize function doesn't handle Float16Array type preservation. When a Float16Array is passed to normalize, it will fall through all the instanceof checks and default to returning a Float32Array, losing type information. This is inconsistent with the TypedArray type definition which includes Float16Array as a supported type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| /** | ||
| * Vector schema for representing vectors as arrays of numbers | ||
| * @param annotations - Additional annotations for the schema | ||
| * @returns The vector schema | ||
| */ |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSDoc comment describes this as "Vector schema" but the function and type are named "TensorSchema" and "Tensor". The documentation should be updated to use "Tensor" consistently or the naming should be clarified to explain the relationship between vectors and tensors in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| private quantizeToUint16(values: number[]): Uint16Array { | ||
| // Find min/max for scaling | ||
| const min = Math.min(...values); | ||
| const max = Math.max(...values); | ||
| const range = max - min || 1; | ||
|
|
||
| // Scale to [0, 65535] | ||
| return new Uint16Array(values.map((v) => Math.round(((v - min) / range) * 65535))); | ||
| } |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quantizeToUint16 method uses spread operator with Math.min/Math.max on the entire values array. For large vectors, this is inefficient as it creates multiple intermediate arrays. Consider using a single loop to find both min and max values simultaneously, which would be more performant and memory-efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
packages/util/src/vector/Tensor.ts
Outdated
| ...annotations, | ||
| }) as const satisfies JsonSchema; | ||
|
|
||
| export type Vector = FromSchema<ReturnType<typeof TensorSchema>, TypedArraySchemaOptions>; |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exported type is named "Vector" but the schema function is named "TensorSchema". This naming inconsistency is confusing. Either the type should be named "Tensor" to match the schema, or the schema should be named "VectorSchema" to match the type. The comments in the file also refer to "vector" rather than "tensor".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
- Updated IExecuteContext and IRunConfig to include registry support. - Refactored TaskRunner and TaskGraphRunner to utilize the service registry for improved task execution and model retrieval. - Ensured backward compatibility while enhancing the overall architecture for better service management. - Introduced a service registry to manage model repositories and execution contexts in AiTask.
- Added a new InputResolver to manage schema-annotated inputs, allowing for automatic resolution of string IDs to their corresponding instances. - Implemented repository and model resolution capabilities, improving task input handling and validation. - Created new schemas for tabular, vector, and document repositories to facilitate input resolution. - Enhanced AiTask and TaskRunner to utilize the input resolver for better integration with service registries. - Added comprehensive tests to ensure the functionality of the input resolver system and its integration with tasks.
…ities - Added several new tasks including ChunkToVectorTask, ContextBuilderTask, DocumentEnricherTask, HierarchicalChunkerTask, and others to support advanced document processing workflows. - Enhanced the input handling for tasks to streamline the integration with the service registry and improve task execution. - Updated the documentation to reflect the new tasks and their functionalities, ensuring clarity for developers. - Implemented comprehensive tests for the new tasks to validate their behavior and integration within the workflow system.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: sroussey <[email protected]>
* Initial plan * Remove unused query variable from InputResolver test Co-authored-by: sroussey <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: sroussey <[email protected]>
Co-authored-by: sroussey <[email protected]>
* Initial plan * Improve markdown auto-detection with robust pattern matching Co-authored-by: sroussey <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: sroussey <[email protected]>
* Initial plan * Remove unused imports ChunkToVectorTask and HierarchicalChunkerTask Co-authored-by: sroussey <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: sroussey <[email protected]>
* Initial plan * Update Tensor.ts to use consistent "tensor" terminology throughout Co-authored-by: sroussey <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: sroussey <[email protected]>
* Initial plan * Optimize quantizeToUint8 and quantizeToUint16 to use single loop for min/max Co-authored-by: sroussey <[email protected]> * Add empty array guard to quantization methods Co-authored-by: sroussey <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: sroussey <[email protected]>
…ument.addVariant (#158) * Initial plan * Use extractConfigFields for type-safe provenance handling Co-authored-by: sroussey <[email protected]> * Add comprehensive tests for type-safe provenance handling Co-authored-by: sroussey <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: sroussey <[email protected]>
* Initial plan * Add comprehensive tests for VectorSimilarityUtils Co-authored-by: sroussey <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: sroussey <[email protected]>
* Initial plan * Extract magic number 512 to DEFAULT_MAX_TOKENS constant Co-authored-by: sroussey <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: sroussey <[email protected]>
Co-authored-by: Copilot <[email protected]>
* Initial plan * Fix naming inconsistency: rename Vector to Tensor in Tensor.ts Co-authored-by: sroussey <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: sroussey <[email protected]>
…lizing inputs to a non-negative range. This includes calculating the global minimum across both vectors and adjusting values accordingly.
…normalization, and handling of various TypedArray types. Update normalize function to support an additional parameter for Float32Array conversion.
* Initial plan * Add circular reference detection to smartClone method Co-authored-by: sroussey <[email protected]> * Fix circular reference detection to handle shared references correctly Co-authored-by: sroussey <[email protected]> * Refactor TaskEvents to import TaskStatus from TaskTypes and add unit tests for smartClone method - Updated TaskEvents to import TaskStatus from the correct module. - Added comprehensive unit tests for the smartClone method, including cases for circular reference detection and handling various data structures. --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: sroussey <[email protected]> Co-authored-by: Steven Roussey <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 126 out of 127 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…f model arrays - Updated setGlobalModelRepository parameter name for clarity. - Enhanced resolveModelFromRegistry to support both single and array of model IDs. - Modified resolveSchemaInputs to handle string values and arrays of strings more effectively, ensuring proper resolution of inputs.
No description provided.