Skip to content

Commit

Permalink
Set config="default" in Distiset when only one leaf Step (#540)
Browse files Browse the repository at this point in the history
* Fix return type hint in `BasePipeline.run`

* Fix casing in `Distilabel`->`distilabel`

* Set `default` as the config if there's only one leaf node

* Add alternative way to `load_dataset` when `config='default'`

* Update `docs/sections/learn/distiset.md`

* Remove outdated `docs/tutorials/*.ipynb`

* Remove `docs/snippets` from `.pre-commit-config.yaml`

* Remove unused `docs/snippets/*.py`

* Rename `HuggingFace`, `huggingface`, etc. to `Hugging Face`

* Fixed some typos with `codespell`

* Fix `TestWriteBuffer` in `create_distiset`

* Fix `test_pipeline_cached` in `test_pipe_simple.py`

* Revert `BasePipeline.run` return type-hint
  • Loading branch information
alvarobartt committed Apr 16, 2024
1 parent af81460 commit 4ef7290
Show file tree
Hide file tree
Showing 21 changed files with 41 additions and 803 deletions.
1 change: 0 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ repos:
- id: insert-license
name: "Insert license header in Python source files"
files: \.py$
exclude: ^docs/snippets/
args:
- --license-filepath
- LICENSE_HEADER
Expand Down
10 changes: 5 additions & 5 deletions docs/sections/learn/caching.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Caching

Distilabel `Pipelines` automatically save all the intermediate steps to to avoid loosing any data in case of error.
Distilabel `Pipelines` automatically save all the intermediate steps to to avoid losing any data in case of error.

## Cache directory

Expand All @@ -13,7 +13,7 @@ with Pipeline("cache_testing") as pipeline:
...
```

This directory can be modified by setting the `DISTILABEL_CACHE_DIR` environment variable (`export DISTILABEL_CACHE_DIR=my_cache_dir`) or by explicitely passing the `cache_dir` variable to the `Pipeline` constructor like so:
This directory can be modified by setting the `DISTILABEL_CACHE_DIR` environment variable (`export DISTILABEL_CACHE_DIR=my_cache_dir`) or by explicitly passing the `cache_dir` variable to the `Pipeline` constructor like so:

```python
with Pipeline("cache_testing", cache_dir="~/my_cache_dir") as pipeline:
Expand Down Expand Up @@ -42,7 +42,7 @@ Finally, if we decide to run the same `Pipeline` after it has finished completel

### Serialization

Let's see what get's serialized by looking at a sample `Pipeline`'s cached folder:
Let's see what gets serialized by looking at a sample `Pipeline`'s cached folder:

```bash
$ tree ~/.cache/distilabel/pipelines/73ca3f6b7a613fb9694db7631cc038d379f1f533
Expand All @@ -65,7 +65,7 @@ The `Pipeline` will have a signature created from the arguments that define it s

- `pipeline.yaml`

This file contains a representation of the `Pipeline` in *YAML* format. If we push a `Distiset` to the hub as obtained from calling `Pipeline.run`, this file will be stored at our datasets' repository, allowing to reproduce the `Pipeline` using the `CLI`:
This file contains a representation of the `Pipeline` in *YAML* format. If we push a `Distiset` to the Hugging Face Hub as obtained from calling `Pipeline.run`, this file will be stored at our datasets' repository, allowing to reproduce the `Pipeline` using the `CLI`:

```bash
distilabel pipeline run --config "path/to/pipeline.yaml"
Expand Down Expand Up @@ -100,5 +100,5 @@ ds

Internally, the function will try to inject the `pipeline_path` variable if it's not passed via argument, assuming
it's in the parent directory of the current one, called `pipeline.yaml`. If the file doesn't exist, it won't
raise any error, but take into account that if the `Distiset` is pushed to the hub, the `pipeline.yaml` won't be
raise any error, but take into account that if the `Distiset` is pushed to the Hugging Face Hub, the `pipeline.yaml` won't be
generated.
2 changes: 1 addition & 1 deletion docs/sections/learn/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ $ distilabel pipeline info --help
╰─────────────────────────────────────────────────────────────────────────────────────╯
```

As we can see from the help message, we need to pass either a `Path` or a `URL`. This second option comes handy for datasets stored in HuggingFace hub, for example:
As we can see from the help message, we need to pass either a `Path` or a `URL`. This second option comes handy for datasets stored in Hugging Face Hub, for example:

```bash
distilabel pipeline info --config "https://huggingface.co/datasets/distilabel-internal-testing/ultrafeedback-mini/raw/main/pipeline.yaml"
Expand Down
7 changes: 5 additions & 2 deletions docs/sections/learn/distiset.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ ds = Distiset(

This object works like a python dictionary (the same approach followed by [`datasets.DatasetDict`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)), where each key corresponds to one of the `leaf_steps` from a `Pipeline`.

!!! NOTE
If there's only one leaf node i.e. only one step at the end of the `Pipeline`, then the configuration name won't be the name of the last step, but it will be set to default instead, as that's more aligned with standard datasets within the Hugging Face Hub.

## Distiset methods

We can interact with the different pieces generated by the `Pipeline` and treat them as different [`configurations`](https://huggingface.co/docs/datasets-server/configs_and_splits#configurations). The `Distiset` contains just two methods:
Expand Down Expand Up @@ -54,9 +57,9 @@ Distiset({
})
```

### Push to HuggingFace hub
### Push to Hugging Face Hub

Pushes the internal subsets to a huggingface repo, where each one of the subsets will be a different configuration, so it's easy to download them and continue working with any of the pieces.
Pushes the internal subsets to a Hugging Face repo, where each one of the subsets will be a different configuration, so it's easy to download them and continue working with any of the pieces.

```python
ds.push_to_hub(
Expand Down
4 changes: 2 additions & 2 deletions docs/sections/learn/steps/generator_steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ It will yield `GeneratorStepOutput` objects, an iterator of tuples where the fir

Unless we are doing some testing, we are more likely going to work with a proper dataset:

### Load a dataset from HuggingFace hub
### Load a dataset from Hugging Face Hub

The easiest way to ingest data from a dataset is using the [`LoadHubDataset`][distilabel.steps.generators.huggingface] step, let's see an example:

Expand All @@ -57,7 +57,7 @@ load_hub_dataset = LoadHubDataset(
load_hub_dataset.load()
```

We see that creating a step to load a dataset from the hub is almost the same as loading it directly using `datasets.load_dataset`, with one remark, we have to call `.load()` on our step. The reason for this extra step is because internally we want to do the actual processing at the correct moment in the whole pipeline, we don't just need to take care of this call because we are working with it outside of a `Pipeline`.
We see that creating a step to load a dataset from the Hugging Face Hub is almost the same as loading it directly using `datasets.load_dataset`, with one remark, we have to call `.load()` on our step. The reason for this extra step is because internally we want to do the actual processing at the correct moment in the whole pipeline, we don't just need to take care of this call because we are working with it outside of a `Pipeline`.

And let's request the following batch:

Expand Down
6 changes: 3 additions & 3 deletions docs/sections/learn/steps/global_steps.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Global Steps

The global steps are the ones that in order to do it's processing, they will need access to all the data at once. Some examples include creating a dataset to be pushed to the hub, or a filtering step in a `Pipeline`.
The global steps are the ones that in order to do it's processing, they will need access to all the data at once. Some examples include creating a dataset to be pushed to the Hugging Face Hub, or a filtering step in a `Pipeline`.

## Push data to HuggingFace Hub in batches
## Push data to Hugging Face Hub in batches

The first example of a `global` step corresponds to [`PushToHub`][distilabel.steps.globals.huggingface]:

Expand All @@ -22,7 +22,7 @@ push_to_hub = PushToHub(
)
```

This step can be used to push batches of the dataset to the hub as the process advances, enabling a checkpoint strategy in your pipeline.
This step can be used to push batches of the dataset to the Hugging Face Hub as the process advances, enabling a checkpoint strategy in your pipeline.

## Data Filtering

Expand Down
2 changes: 1 addition & 1 deletion docs/sections/learn/steps/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ This is a small type step that shows what to expect when we are creating our `St

## Runtime Parameters

Let's take a look at a special argument implementation that we will find when dealing with the `Steps`, the [Runtime paramaters][distilabel.mixins.runtime_parameters.RuntimeParameter]. Let's inspect them using the previous example class:
Let's take a look at a special argument implementation that we will find when dealing with the `Steps`, the [Runtime parameters][distilabel.mixins.runtime_parameters.RuntimeParameter]. Let's inspect them using the previous example class:

```python
print(conversation_template.runtime_parameters_names)
Expand Down
4 changes: 2 additions & 2 deletions docs/sections/learn/tasks/feedback_tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This section presents tasks that work on the `LLM` output to return some feedbac

## UltraFeedback

[`UltraFeedback`][distilabel.steps.tasks.ultrafeedback] is a `Task` inspired from [`UltraFeedback: Boosting Language Models with High-quality Feedback`](https://arxiv.org/abs/2310.01377), where the authors present the methodology that leaded to the creation of their famous dataset:
[`UltraFeedback`][distilabel.steps.tasks.ultrafeedback] is a `Task` inspired from [`UltraFeedback: Boosting Language Models with High-quality Feedback`](https://arxiv.org/abs/2310.01377), where the authors present the methodology that led to the creation of their famous dataset:

```python
from distilabel.steps.tasks import UltraFeedback
Expand Down Expand Up @@ -65,7 +65,7 @@ Let's see what this different aspects mean.

### Different aspects of UltraFeedback

The `UltraFeedback` paper proposes different types of aspect to rate the answers: `helpfulness`, `honesty`, `instruction-following`, `truthfulness`. If one want's to rate the responses according to the 4 aspects, it would imply running the `Pipeline` 4 times, incurring in more costs and time of processing. For that reason, we decided to include an extra aspect, which tries to sum up the other ones to return a special type of summary: `overall-rating`.
The `UltraFeedback` paper proposes different types of aspect to rate the answers: `helpfulness`, `honesty`, `instruction-following`, `truthfulness`. If one wants to rate the responses according to the 4 aspects, it would imply running the `Pipeline` 4 times, incurring in more costs and time of processing. For that reason, we decided to include an extra aspect, which tries to sum up the other ones to return a special type of summary: `overall-rating`.

!!! Note
Take a look at this task in a complete `Pipeline` at [`UltraFeedback`](../../papers/ultrafeedback.md), where you can follow the paper implementation.
Expand Down
2 changes: 1 addition & 1 deletion docs/sections/learn/tasks/text_generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ from distilabel.steps.tasks.text_generation import TextGeneration
system_prompt = "You are an AI judge in charge of determining the equality of two instructions. "

wizardllm_equal_prompt = """Here are two Instructions, do you think they are equal to each other and meet the following requirements?:
1. They have the same constraints and requirments.
1. They have the same constraints and requirements.
2. They have the same depth and breadth of the inquiry.
The First Prompt: {instruction_1}
The Second Prompt: {instruction_2}
Expand Down
2 changes: 1 addition & 1 deletion docs/sections/papers/deita.md
Original file line number Diff line number Diff line change
Expand Up @@ -346,7 +346,7 @@ distiset = pipeline.run(
)
```

We can push the results to the hub:
We can push the results to the Hugging Face Hub:

```python
distiset.push_to_hub("distilabel-internal-testing/deita-colab")
Expand Down

This file was deleted.

This file was deleted.

Loading

0 comments on commit 4ef7290

Please sign in to comment.