Skip to content

Commit

Permalink
Add documentation for the pipeline script uploaded
Browse files Browse the repository at this point in the history
  • Loading branch information
plaguss committed Jul 1, 2024
1 parent e5b28ad commit 7fbdd5a
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 2 deletions.
17 changes: 17 additions & 0 deletions docs/sections/how_to_guides/advanced/distiset.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,9 +67,26 @@ distiset.push_to_hub(
commit_message="Initial commit",
private=False,
token=os.getenv("HF_TOKEN"),
generate_card=True,
include_script=False
)
```

!!! info "New since version 1.3.0"
Since version `1.3.0` you can automatically push the script that created your pipeline to the same repository. For example, assuming you have a file like the following:

``` py title="sample_pipe.py"
with Pipeline() as pipe:
...
distiset = pipe.run()
distiset.push_to_hub(
"my-org/my-dataset,
include_script=True
)
```

After running the command, you could visit the repository and the file `sample_pipe.py` will be stored to simplify sharing your pipeline with the community.

### Save and load from disk

Take into account that these methods work as `datasets.load_from_disk` and `datasets.Dataset.save_to_disk` so the arguments are directly passed to those methods. This means you can also make use of `storage_options` argument to save your [`Distiset`][distilabel.distiset.Distiset] in your cloud provider, including the distilabel artifacts (`pipeline.yaml`, `pipeline.log` and the `README.md` with the dataset card). You can read more in `datasets` documentation [here](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets).
Expand Down
7 changes: 5 additions & 2 deletions src/distilabel/distiset.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def push_to_hub(
private: bool = False,
token: Optional[str] = None,
generate_card: bool = True,
include_script: bool = True,
include_script: bool = False,
**kwargs: Any,
) -> None:
"""Pushes the `Distiset` to the Hugging Face Hub, each dataset will be pushed as a different configuration
Expand All @@ -84,7 +84,10 @@ def push_to_hub(
Whether to generate a dataset card or not. Defaults to True.
include_script:
Whether you want to push the pipeline script to the hugging face hub to share it.
Defaults to True
If set to True, the name of the script that was run to create the distiset will be
automatically determined, and that will be the name of the file uploaded to your
repository. Take into account, this operation only makes sense for a distiset obtained
from calling `Pipeline.run()` method. Defaults to False.
**kwargs:
Additional keyword arguments to pass to the `push_to_hub` method of the `datasets.Dataset` object.
Expand Down

0 comments on commit 7fbdd5a

Please sign in to comment.