You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+48-4Lines changed: 48 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -321,13 +321,57 @@ Some options common to most readers:
321
321
-`limit` read only a certain number of samples. Useful for testing/debugging
322
322
323
323
### Synthetic data generation
324
-
We support [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang) for inference using the [InferenceRunner block](src/datatrove/pipeline/inference/run_inference.py). Each datatrove task will spawn a replica of the target model and full asynchronous continuous batching will guarantee high model throughput.
324
+
Install the inference extras with `pip install datatrove[inference]` to pull in the lightweight HTTP client, checkpointing dependencies and async sqlite cache.
325
325
326
-
By setting `checkpoints_local_dir`and `records_per_chunk` generations will be written to a local folder until a chunk is complete, allowing for checkpointing in case tasks fail or are preempted.
326
+
We support [vLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), OpenAI-compatible HTTPS endpoints and a local `dummy` server through the [InferenceRunner block](src/datatrove/pipeline/inference/run_inference.py). Each datatrove task can spin up its own server replica (for `vllm`, `sglang` or `dummy`) or talk directly to an external endpoint while asynchronous batching keeps GPU utilization high.
327
327
328
-
Tune `max_concurrent_requests` to tune batching behaviour. If you have slow pre-processing, you can also increase `max_concurrent_tasks` (to a value higher than `max_concurrent_requests`).
328
+
Rollouts are plain async callables that receive a `Document`, a `generate(payload)` callback and any extra kwargs coming from `shared_context`. You can freely orchestrate multiple sequential or parallel `generate` calls inside the rollout. Set `rollouts_per_document`to automatically run the same rollout multiple times per sample; the runner collects successful outputs under `document.metadata["rollout_results"]`.
329
329
330
-
Refer to the [example](examples/inference_example_chunked.py) for more info.
330
+
`shared_context` lets you inject shared state into every rollout invocation. It accepts:
331
+
- a dict (passed through as keyword arguments),
332
+
- a callable returning a dict (handy for lazily creating resources),
333
+
- a context manager or a callable returning one (great for pools, GPU allocators, temp dirs, etc.). Context managers are properly entered/exited once per task.
334
+
335
+
Recoverable generation:
336
+
- Setting `checkpoints_local_dir` together with `records_per_chunk` writes every `Document` to local chunk files (remember to include `${chunk_index}` in the output filename template), then uploads them via the configured writer. Failed tasks automatically resume from the last finished chunk.
337
+
- When checkpointing is enabled a sqlite-backed `RequestCache` deduplicates individual rollouts via payload hashes (requires `xxhash` and `aiosqlite`) so completed generations are never re-sent during retries.
338
+
339
+
Tune batching with `max_concurrent_generations` and, when pre/post-processing is heavy, raise `max_concurrent_documents` to allow more rollout coroutines to build payloads while requests are in flight.
340
+
341
+
<details>
342
+
<summary>Minimal end-to-end example</summary>
343
+
344
+
```
345
+
from datatrove.data import Document
346
+
from datatrove.executor.local import LocalPipelineExecutor
347
+
from datatrove.pipeline.inference.run_inference import InferenceConfig, InferenceRunner
348
+
from datatrove.pipeline.writers import JsonlWriter
The extended [inference_example_chunked.py](examples/inference_example_chunked.py) script demonstrates single- and multi-rollout flows, resumable checkpoints and sharing a process pool across rollouts.
331
375
332
376
### Extracting text
333
377
You can use [extractors](src/datatrove/pipeline/extractors) to extract text content from raw html. The most commonly used extractor in datatrove is [Trafilatura](src/datatrove/pipeline/extractors/trafilatura.py), which uses the [trafilatura](https://trafilatura.readthedocs.io/en/latest/) library.
0 commit comments