You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Speculators is a unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference. A smaller, faster speculator drafts tokens, while the larger verifier accepts or rejects them in a single forward pass—delivering lower latency with lossless quality.
Today, Speculators ships configs, model definitions, and converters that standardize outputs from external training flows and in‑repo research prototypes into a format consumable by vLLM. But those flows are brittle, duplicated, and hard to use. This document kicks off the first productization step: data generation. Because it already runs separately and interfaces through datasets, it’s a natural entry point to validate correctness while adding robustness, APIs, CLIs, and features. The goal is to harden the current offline (disk‑based) pathway and expand to online (queue‑based) pipelines for consistent, performant, and easy‑to‑adapt workflows.
Goals
Goals
Productize data generation into robust, reusable APIs + CLI with clear contracts.
Support offline (disk-based) and online (queue-based) pipelines with a shared abstraction.
Provide a Pydantic-based config for data-specific attributes (datasets, splits, columns, chat template, processors, verifier, states to persist).
Ship a Transformers inference engine now; design for vLLM (and others) later without churn.
Deliver a dataset reader compatible with PyTorch DataLoader for training pipelines (Eagle3, HASS today).
Baseline performance and correctness:
Performance/accuracy parity with current research scripts for offline pathways.
Throughput uplift for online mode vs. offline: TBD on target.
Security & Safety: safe serialization (no arbitrary code exec), path whitelisting, schema validation at boundaries.
Cost efficiency: avoid unnecessary copies; optional compression for disk and wire.
Maintainability: small, composable interfaces; Pydantic models as contracts; tests at unit + integration levels.
Design
High-Level Architecture & Main Components
flowchart LR
A[Source Datasets\nHF IDs, local paths, text files] --> B[Loader + Preprocessor\ncolumns, chat template, processors]
B --> C[Inference Engine\nTransformers now, vLLM later]
C --> D[Standardized Example Dict\nprompt, tokens, logits, hidden states]
D --> E{Storage/Sync}
E -->|Offline| F[Disk Shards\nmsgpack/zstd, parquet, or arrow]
E -->|Online\nsingle-node| G[mp.Queue]
E -->|Online\nmulti-proc/node| H[ZeroMQ Broker]
F & G & H --> I[Generated Dataset\nPyTorch DataLoader]
I --> J[Training Pipelines\nEagle3/HASS]
Loading
Key Interfaces (APIs / Contracts)
Config (DataGenConfig) — data-only knobs:
datasets: list of sources (hf_id / local_path / text_file), with split + column mapping.
chat_template: hf_id or inline template config (default: verifier’s).
processor: overrides for tokenizer/processor settings (default: verifier’s).
verifier: model ref (hf_id, local path, pretrained).
states: which outputs to persist (prompt, output_tokens, logits, hidden_states={layers:[...], reduce?:mean/none}).
Engine may run batched inference but emits 1 example per input (optionally: store batch as N single records to the sink). — Targeted for follow up
Offline vs Online Flows
sequenceDiagram
participant CLI
participant Worker as Gen Worker(s)
participant Sink as Disk/Queue/ZMQ
participant Train as Training Proc
CLI->>Worker: generate_data config, engine_kwargs, num_workers
Worker->>Sink: write example / write_batch
par Offline
Train->>Sink: source.read from Disk shards
and Online - mp.Queue
Train->>Sink: source.read bounded queue with backpressure
and Online - ZeroMQ
Train->>Sink: connect url, source.read brokered
end
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Speculators is a unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference. A smaller, faster speculator drafts tokens, while the larger verifier accepts or rejects them in a single forward pass—delivering lower latency with lossless quality.
Today, Speculators ships configs, model definitions, and converters that standardize outputs from external training flows and in‑repo research prototypes into a format consumable by vLLM. But those flows are brittle, duplicated, and hard to use. This document kicks off the first productization step: data generation. Because it already runs separately and interfaces through datasets, it’s a natural entry point to validate correctness while adding robustness, APIs, CLIs, and features. The goal is to harden the current offline (disk‑based) pathway and expand to online (queue‑based) pipelines for consistent, performant, and easy‑to‑adapt workflows.
Goals
Requirements
Design
High-Level Architecture & Main Components
Key Interfaces (APIs / Contracts)
DataGenConfig
) — data-only knobs:datasets
: list of sources (hf_id / local_path / text_file), with split + column mapping.chat_template
: hf_id or inline template config (default: verifier’s).processor
: overrides for tokenizer/processor settings (default: verifier’s).verifier
: model ref (hf_id, local path, pretrained).states
: which outputs to persist (prompt
,output_tokens
,logits
,hidden_states={layers:[...], reduce?:mean/none}
).InferencePipeline
)from_config(config: DataGenConfig, **engine_kwargs) -> InferencePipeline
forward(batch: list[InputItem]) -> list[ExampleDict]
DataSink
&DataSource
)sink.write(example: ExampleDict) -> None
(orwrite_batch(list[ExampleDict])
)source.read() -> Iterator[ExampleDict]
DiskStore
,MPQueueStore
,ZmqStore
.GeneratedDataset
)DataSource
and yields tensors/arrays in training-ready format.Data Model (Example Schema)
ExampleDict (persisted/generated):
Storage Formats
Batching Behavior
Offline vs Online Flows
Implementation
Features to Build
DataGenConfig
(data sources, splits, columns, chat template, processor, verifier, states).EngineConfig
(engine-local: batch size, dtype, device map, max seq len).SyncConfig
(disk/mp.Queue/ZeroMQ, common args: queue size, URLs, shard sizes, compression).InferencePipeline
interface + TransformersPipeline implementation (initial).DiskStore
MPQueueStore
ZmqStore
GeneratedDataset
(iterable; worker-safe init; shared args propagation; transforms to tensors).speculators generate-data --config path.yaml [--engine.* --sync.* --run.*]
generate_data(config: DataGenConfig, **kwargs)
generate_data_process(...)
,generate_data_worker(...)
.Tasks / Milestones
M0 — Architecture & Contracts (blocking)
InferenceEngine
,DataSink/DataSource
,GeneratedDataset
.M1 — Offline Path (Disk)
M2 — Online Single-Process/Node (mp.Queue)
M3 — Online Multi-Proc/Node (ZeroMQ)
M4 — Hardening & Extensibility
Appendix: Concrete Interfaces (First Pass)
Packages & Layout
Pydantic Config Sketch
CLI Examples
Beta Was this translation helpful? Give feedback.
All reactions