Set config="default" in Distiset when only one leaf Step (#540)

* Fix return type hint in `BasePipeline.run` * Fix casing in `Distilabel`->`distilabel` * Set `default` as the config if there's only one leaf node * Add alternative way to `load_dataset` when `config='default'` * Update `docs/sections/learn/distiset.md` * Remove outdated `docs/tutorials/*.ipynb` * Remove `docs/snippets` from `.pre-commit-config.yaml` * Remove unused `docs/snippets/*.py` * Rename `HuggingFace`, `huggingface`, etc. to `Hugging Face` * Fixed some typos with `codespell` * Fix `TestWriteBuffer` in `create_distiset` * Fix `test_pipeline_cached` in `test_pipe_simple.py` * Revert `BasePipeline.run` return type-hint
argilla-io · Apr 16, 2024 · 4ef7290 · 4ef7290
1 parent af81460
commit 4ef7290
Show file tree

Hide file tree

Showing 21 changed files with 41 additions and 803 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -5,7 +5,6 @@ repos:
       - id: insert-license
         name: "Insert license header in Python source files"
         files: \.py$
-        exclude: ^docs/snippets/
         args:
           - --license-filepath
           - LICENSE_HEADER

diff --git a/docs/sections/learn/caching.md b/docs/sections/learn/caching.md
@@ -1,6 +1,6 @@
 # Caching
 
-Distilabel `Pipelines` automatically save all the intermediate steps to to avoid loosing any data in case of error.
+Distilabel `Pipelines` automatically save all the intermediate steps to to avoid losing any data in case of error.
 
 ## Cache directory
 
@@ -13,7 +13,7 @@ with Pipeline("cache_testing") as pipeline:
     ...
 ```
 
-This directory can be modified by setting the `DISTILABEL_CACHE_DIR` environment variable (`export DISTILABEL_CACHE_DIR=my_cache_dir`) or by explicitely passing the `cache_dir` variable to the `Pipeline` constructor like so:
+This directory can be modified by setting the `DISTILABEL_CACHE_DIR` environment variable (`export DISTILABEL_CACHE_DIR=my_cache_dir`) or by explicitly passing the `cache_dir` variable to the `Pipeline` constructor like so:
 
 ```python
 with Pipeline("cache_testing", cache_dir="~/my_cache_dir") as pipeline:
@@ -42,7 +42,7 @@ Finally, if we decide to run the same `Pipeline` after it has finished completel
 
 ### Serialization
 
-Let's see what get's serialized by looking at a sample `Pipeline`'s cached folder:
+Let's see what gets serialized by looking at a sample `Pipeline`'s cached folder:
 
 ```bash
 $ tree ~/.cache/distilabel/pipelines/73ca3f6b7a613fb9694db7631cc038d379f1f533
@@ -65,7 +65,7 @@ The `Pipeline` will have a signature created from the arguments that define it s
 
 - `pipeline.yaml`
 
-    This file contains a representation of the `Pipeline` in *YAML* format. If we push a `Distiset` to the hub as obtained from calling `Pipeline.run`, this file will be stored at our datasets' repository, allowing to reproduce the `Pipeline` using the `CLI`:
+    This file contains a representation of the `Pipeline` in *YAML* format. If we push a `Distiset` to the Hugging Face Hub as obtained from calling `Pipeline.run`, this file will be stored at our datasets' repository, allowing to reproduce the `Pipeline` using the `CLI`:
 
     ```bash
     distilabel pipeline run --config "path/to/pipeline.yaml"
@@ -100,5 +100,5 @@ ds
 
     Internally, the function will try to inject the `pipeline_path` variable if it's not passed via argument, assuming
     it's in the parent directory of the current one, called `pipeline.yaml`. If the file doesn't exist, it won't
-    raise any error, but take into account that if the `Distiset` is pushed to the hub, the `pipeline.yaml` won't be
+    raise any error, but take into account that if the `Distiset` is pushed to the Hugging Face Hub, the `pipeline.yaml` won't be
     generated.
diff --git a/docs/sections/learn/cli.md b/docs/sections/learn/cli.md
@@ -43,7 +43,7 @@ $ distilabel pipeline info --help
 ╰─────────────────────────────────────────────────────────────────────────────────────╯
 ```
 
-As we can see from the help message, we need to pass either a `Path` or a `URL`. This second option comes handy for datasets stored in HuggingFace hub, for example:
+As we can see from the help message, we need to pass either a `Path` or a `URL`. This second option comes handy for datasets stored in Hugging Face Hub, for example:
 
 ```bash
 distilabel pipeline info --config "https://huggingface.co/datasets/distilabel-internal-testing/ultrafeedback-mini/raw/main/pipeline.yaml"

diff --git a/docs/sections/learn/distiset.md b/docs/sections/learn/distiset.md
@@ -20,6 +20,9 @@ ds = Distiset(
 
 This object works like a python dictionary (the same approach followed by [`datasets.DatasetDict`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)), where each key corresponds to one of the `leaf_steps` from a `Pipeline`.
 
+!!! NOTE
+    If there's only one leaf node i.e. only one step at the end of the `Pipeline`, then the configuration name won't be the name of the last step, but it will be set to default instead, as that's more aligned with standard datasets within the Hugging Face Hub.
+
 ## Distiset methods
 
 We can interact with the different pieces generated by the `Pipeline` and treat them as different [`configurations`](https://huggingface.co/docs/datasets-server/configs_and_splits#configurations). The `Distiset` contains just two methods:
@@ -54,9 +57,9 @@ Distiset({
 })
 ```
 
-### Push to HuggingFace hub
+### Push to Hugging Face Hub
 
-Pushes the internal subsets to a huggingface repo, where each one of the subsets will be a different configuration, so it's easy to download them and continue working with any of the pieces.
+Pushes the internal subsets to a Hugging Face repo, where each one of the subsets will be a different configuration, so it's easy to download them and continue working with any of the pieces.
 
 ```python
 ds.push_to_hub(

diff --git a/docs/sections/learn/steps/generator_steps.md b/docs/sections/learn/steps/generator_steps.md
@@ -40,7 +40,7 @@ It will yield `GeneratorStepOutput` objects, an iterator of tuples where the fir
 
 Unless we are doing some testing, we are more likely going to work with a proper dataset:
 
-### Load a dataset from HuggingFace hub
+### Load a dataset from Hugging Face Hub
 
 The easiest way to ingest data from a dataset is using the [`LoadHubDataset`][distilabel.steps.generators.huggingface] step, let's see an example:
 
@@ -57,7 +57,7 @@ load_hub_dataset = LoadHubDataset(
 load_hub_dataset.load()
 ```
 
-We see that creating a step to load a dataset from the hub is almost the same as loading it directly using `datasets.load_dataset`, with one remark, we have to call `.load()` on our step. The reason for this extra step is because internally we want to do the actual processing at the correct moment in the whole pipeline, we don't just need to take care of this call because we are working with it outside of a `Pipeline`.
+We see that creating a step to load a dataset from the Hugging Face Hub is almost the same as loading it directly using `datasets.load_dataset`, with one remark, we have to call `.load()` on our step. The reason for this extra step is because internally we want to do the actual processing at the correct moment in the whole pipeline, we don't just need to take care of this call because we are working with it outside of a `Pipeline`.
 
 And let's request the following batch:
 

diff --git a/docs/sections/learn/steps/global_steps.md b/docs/sections/learn/steps/global_steps.md
@@ -1,8 +1,8 @@
 # Global Steps
 
-The global steps are the ones that in order to do it's processing, they will need access to all the data at once. Some examples include creating a dataset to be pushed to the hub, or a filtering step in a `Pipeline`.
+The global steps are the ones that in order to do it's processing, they will need access to all the data at once. Some examples include creating a dataset to be pushed to the Hugging Face Hub, or a filtering step in a `Pipeline`.
 
-## Push data to HuggingFace Hub in batches
+## Push data to Hugging Face Hub in batches
 
 The first example of a `global` step corresponds to [`PushToHub`][distilabel.steps.globals.huggingface]:
 
@@ -22,7 +22,7 @@ push_to_hub = PushToHub(
 )
 ```
 
-This step can be used to push batches of the dataset to the hub as the process advances, enabling a checkpoint strategy in your pipeline.
+This step can be used to push batches of the dataset to the Hugging Face Hub as the process advances, enabling a checkpoint strategy in your pipeline.
 
 ## Data Filtering
 

diff --git a/docs/sections/learn/steps/index.md b/docs/sections/learn/steps/index.md
@@ -95,7 +95,7 @@ This is a small type step that shows what to expect when we are creating our `St
 
 ## Runtime Parameters
 
-Let's take a look at a special argument implementation that we will find when dealing with the `Steps`, the [Runtime paramaters][distilabel.mixins.runtime_parameters.RuntimeParameter]. Let's inspect them using the previous example class:
+Let's take a look at a special argument implementation that we will find when dealing with the `Steps`, the [Runtime parameters][distilabel.mixins.runtime_parameters.RuntimeParameter]. Let's inspect them using the previous example class:
 
 ```python
 print(conversation_template.runtime_parameters_names)

diff --git a/docs/sections/learn/tasks/feedback_tasks.md b/docs/sections/learn/tasks/feedback_tasks.md
@@ -5,7 +5,7 @@ This section presents tasks that work on the `LLM` output to return some feedbac
 
 ## UltraFeedback
 
-[`UltraFeedback`][distilabel.steps.tasks.ultrafeedback] is a `Task` inspired from [`UltraFeedback: Boosting Language Models with High-quality Feedback`](https://arxiv.org/abs/2310.01377), where the authors present the methodology that leaded to the creation of their famous dataset:
+[`UltraFeedback`][distilabel.steps.tasks.ultrafeedback] is a `Task` inspired from [`UltraFeedback: Boosting Language Models with High-quality Feedback`](https://arxiv.org/abs/2310.01377), where the authors present the methodology that led to the creation of their famous dataset:
 
 ```python
 from distilabel.steps.tasks import UltraFeedback
@@ -65,7 +65,7 @@ Let's see what this different aspects mean.
 
 ### Different aspects of UltraFeedback
 
-The `UltraFeedback` paper proposes different types of aspect to rate the answers: `helpfulness`, `honesty`, `instruction-following`, `truthfulness`. If one want's to rate the responses according to the 4 aspects, it would imply running the `Pipeline` 4 times, incurring in more costs and time of processing. For that reason, we decided to include an extra aspect, which tries to sum up the other ones to return a special type of summary: `overall-rating`.
+The `UltraFeedback` paper proposes different types of aspect to rate the answers: `helpfulness`, `honesty`, `instruction-following`, `truthfulness`. If one wants to rate the responses according to the 4 aspects, it would imply running the `Pipeline` 4 times, incurring in more costs and time of processing. For that reason, we decided to include an extra aspect, which tries to sum up the other ones to return a special type of summary: `overall-rating`.
 
 !!! Note
     Take a look at this task in a complete `Pipeline` at [`UltraFeedback`](../../papers/ultrafeedback.md), where you can follow the paper implementation.

diff --git a/docs/sections/learn/tasks/text_generation.md b/docs/sections/learn/tasks/text_generation.md
@@ -75,7 +75,7 @@ from distilabel.steps.tasks.text_generation import TextGeneration
 system_prompt = "You are an AI judge in charge of determining the equality of two instructions. "
 
 wizardllm_equal_prompt = """Here are two Instructions, do you think they are equal to each other and meet the following requirements?:
-1. They have the same constraints and requirments.
+1. They have the same constraints and requirements.
 2. They have the same depth and breadth of the inquiry.
 The First Prompt: {instruction_1}
 The Second Prompt: {instruction_2}

diff --git a/docs/sections/papers/deita.md b/docs/sections/papers/deita.md
@@ -346,7 +346,7 @@ distiset = pipeline.run(
 )
 ```
 
-We can push the results to the hub:
+We can push the results to the Hugging Face Hub:
 
 ```python
 distiset.push_to_hub("distilabel-internal-testing/deita-colab")

diff --git a/docs/snippets/technical-reference/pipeline/pipeline_dataset_checkpoint_4.py b/docs/snippets/technical-reference/pipeline/pipeline_dataset_checkpoint_4.py
diff --git a/docs/snippets/technical-reference/tasks/complexity_scorer_example.py b/docs/snippets/technical-reference/tasks/complexity_scorer_example.py