Skip to content

Commit

Permalink
Add small check to add pipeline yaml and add hint in docs (#534)
Browse files Browse the repository at this point in the history
  • Loading branch information
plaguss committed Apr 15, 2024
1 parent 4d60415 commit 1b3c1c1
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 1 deletion.
11 changes: 10 additions & 1 deletion docs/sections/learn/caching.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,9 @@ The `Pipeline` will have a signature created from the arguments that define it s

Folder that stores the data generated, with a special folder to keep track of each `leaf_step` separately. We can recreate a `Distiset` from the contents of this folder (*Parquet* files), as we will see next.

In case we wanted to regenerate the dataset from the `cache` folder for whatever reason, we can do it using the `create_distiset` and passing the path to the `/data` folder inside our `Pipeline`:
## create_distiset

In case we wanted to regenerate the dataset from the `cache`, we can do it using the [`create_distiset`][distilabel.distiset.create_distiset] and passing the path to the `/data` folder inside our `Pipeline`:

```python
from pathlib import Path
Expand All @@ -93,3 +95,10 @@ ds
# })
# })
```

!!! Note

Internally, the function will try to inject the `pipeline_path` variable if it's not passed via argument, assuming
it's in the parent directory of the current one, called `pipeline.yaml`. If the file doesn't exist, it won't
raise any error, but take into account that if the `Distiset` is pushed to the hub, the `pipeline.yaml` won't be
generated.
8 changes: 8 additions & 0 deletions src/distilabel/distiset.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,8 @@ def create_distiset(data_dir: Path, pipeline_path: Optional[Path] = None) -> Dis
"""
logger = logging.getLogger("distilabel.distiset")

data_dir = Path(data_dir)

distiset = Distiset()
for file in data_dir.iterdir():
if file.is_file():
Expand All @@ -221,5 +223,11 @@ def create_distiset(data_dir: Path, pipeline_path: Optional[Path] = None) -> Dis

if pipeline_path:
distiset.pipeline_path = pipeline_path
else:
# If the pipeline path is not provided, try to find it in the parent directory
# and assume that's the wanted file.
pipeline_path = data_dir.parent / "pipeline.yaml"
if pipeline_path.exists():
distiset.pipeline_path = pipeline_path

return distiset

0 comments on commit 1b3c1c1

Please sign in to comment.