Skip to content

Commit

Permalink
Merge pull request #22 from facebookresearch/tuan/update_slurm_docs
Browse files Browse the repository at this point in the history
Update documentation
  • Loading branch information
antoine-tran authored Jan 29, 2025
2 parents d640223 + 69ac5ac commit fd7db80
Show file tree
Hide file tree
Showing 6 changed files with 44 additions and 26 deletions.
49 changes: 31 additions & 18 deletions examples/evaluation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,15 @@ After you have trained an LCM, the checkpoint will be saved in a folder under th
Since an LCM expects input data in sentence level, we need to preprocess the evaluation datasets accordingly. This includes parsing the raw content and
splitting texts into sentences, then embedding them into vectors using a Sonar encoder.

The example below shows how we prepare the data for CNN Dailymail. We load the dataset from Huggingface using [`datasets` API](https://huggingface.co/docs/datasets/en/index). The sentence splitting is done using [wtpsplit](https://github.com/segment-any-text/wtpsplit). First, we install necessary libraries:

```shell
python -m pip install datasets wtpsplit
```
The example below shows how we prepare the data for CNN Dailymail. We load the dataset from Huggingface using [`datasets` API](https://huggingface.co/docs/datasets/en/index). The sentence splitting is done using [wtpsplit](https://github.com/segment-any-text/wtpsplit). Make sure to specify `--extra data` in installing the project to include these libraries.

All processing logic is implemented in the file `prepare_evaluation_data.py`, as described below.

### Step 1.1: Process the split:
Next, we download and parse the content (source text and summaries), saving different splits into JSON format

```shell
python prepare_evaluation_data.py prepare_data \
uv run --extra data prepare_evaluation_data.py prepare_data \
--dataset_name=cnn_dailymail \
--output_dir=jsonl_dataset \
--source_text_column=article \
Expand All @@ -41,16 +37,30 @@ The output will be stored in different files `[split].jsonl` under the directory
To perform sentence splitting and sonar embedding for each split, run the following command:

```shell
python prepare_evaluation_data.py embed \
uv run --extra data prepare_evaluation_data.py embed \
--input_path=jsonl_dataset/cnn_dailymail/test.jsonl \
--input_column=article \
--output_column=highlights \
--source_text_column=prompt \
--target_text_column=answer \
--output_dir=parquet_dataset/cnn_dailymail \
--lang=eng_Latn \
--mode=slurm \
--mode=local \
--log_dir=/tmp/logs/embed_cnndm
```

Depending on your machine, this might take some time. Alternatively, you can try to run in your SLURM cluster with the argmnent `--mode=slurm --shards=NO_OF_PARALLEL_JOBS`. This requires changing your SLURM config accordingly. We use [submitit](https://github.com/facebookincubator/submitit) to configure the job launcher. Here is the relevant excerpt in the script:

```python
launcher = Launcher(
cache=None,
config_dump_dir=Path(log_dir) / "conf",
log_folder=Path(log_dir) / "logs",
cluster=mode,
update_parameters={"partition": "your_slurm_partition"},
)
_ = await launcher.schedule(inst_stopes_module)
```



## Step 2: Choose the predictor for evaluation

Expand Down Expand Up @@ -121,7 +131,7 @@ uv run torchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation \
--dump_dir output_results
```

Note the missing parameters `source_text_column` and `target_text_column` and the new parameters `source_prefix_text`, `target_prefix_text`, since in this case, we do not modify the column schema, therefore the original text columns ("article", "highlights") are kept and not specified in the CLI.
> **_NOTE:_** the missing parameters `source_text_column` and `target_text_column` and the new parameters `source_prefix_text`, `target_prefix_text` are becase we do not modify the column schema. Therefore, the original text columns ("article", "highlights") are kept and not specified in the CLI.
It is also possible to provide the prompt from a YAML file. This is handy when you have to engineer the prompts carefully and have a very long detailed text. We provide one example prompt in the file [instruction.yaml](./instruction.yaml). The example command is:

Expand Down Expand Up @@ -151,6 +161,10 @@ uv run torchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation \
--tasks lcm_generation \
--task_args '{"max_gen_len": 200}' \
--dataset.parquet_path parquet_dataset/cnn_dailymail \
--dataset.source_column prompt_sentences_sonar_emb \
--dataset.source_text_column prompt_sentences \
--dataset.target_column answer_sentences_sonar_emb \
--dataset.target_text_column answer_sentences \
--data_loading.batch_size 16 \
--dump_dir output_results
```
Expand All @@ -168,13 +182,12 @@ Similar to LLM evaluation, it is possible to specify the prompt prefix and suffi
| `data_loading.batch_size` | Loading and evaluate data in batch. By default `batch_size=10` |
| `dataset_dir` | The directory consists of different JSONL files processed in Step 1. Only used in LLM evaluation
| `dataset.parquet_path` | The parquet path consists of different Parquet files files processed in Step 1. Only used in LCM evaluation
| `dataset.source_column` | The column in the data that refers to the input embedding. Not applicable when evaluating LLMs
| `dataset.source_text_column` | The column in the data that refers to the input text. Not applicable when evaluating LCMs
| `dataset.source_text_column` | The column in the data that refers to the input text. Not applicable when evaluating LCMs
| `dataset.target_column` | The column in the data that refers to the ground-truth embedding. Not applicable when evaluating LLMs
| `dataset.target_text_column` | The column in the data that refers to the ground-truth text. Not applicable when evaluating LCMs
| `dataset.source_column` | The column in the data that refers to the input embedding. Not applicable when evaluating LLMs.
| `dataset.source_text_column` | The column in the data that refers to the input text.
| `dataset.target_column` | The column in the data that refers to the ground-truth embedding. Not applicable when evaluating LLMs.
| `dataset.target_text_column` | The column in the data that refers to the ground-truth text.
| `dataset.source_text_prefix` | The text that will prepended to each input text to make the prompt for the model.
| `dataset.source_text_prefix` | The text that will appended after each input text to make the prompt for the model.
| `dataset.source_text_suffix` | The text that will appended after each input text to make the prompt for the model.
| `task_args` | The JSON-formatted string that represents the task arguments. See [task param list](#task_param_list) below.
| `dump_dir` | The directory consisting output of the eval run. If successful, there should be a file `metrics.eval.jsonl` that consists of metric results, the directory `results` that capture the verbose command line used with the detailed output scores, and the directory `raw_results` that shows
the model output for each individual sample, together with the per-sample metric results.
Expand Down Expand Up @@ -223,7 +236,7 @@ shards=NUMBER_OF_SLURM_NODES
timeout_min=JOB_TIMEOUT_IN_MINUTES


python -m lcm.evaluation \
uv run -m lcm.evaluation \
--predictor two_tower_diffusion_lcm \
--model_card path/to/the/model_card.yaml \
--generator_batch_size 16 \
Expand Down
4 changes: 3 additions & 1 deletion examples/evaluation/prepare_evaluation_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,7 @@ async def embed(
target_text_column: Optional[str] = OUTPUT_KEY,
lang: str = "eng_Latn",
mode: Literal["local", "slurm"] = "local",
shards: int = 1,
log_dir: Optional[str] = None,
):
inst_sonar_config = SonarColumnRenameAndEmbedConfig(
Expand All @@ -212,6 +213,7 @@ async def embed(
Path(input_path),
batch_size=10, # iterating by small number of documents
batch_format=BatchFormat.ARROW,
num_shards=shards,
)

output_config = ParquetOutputConfig(output_dir)
Expand All @@ -230,7 +232,7 @@ async def embed(
config_dump_dir=Path(log_dir) / "conf",
log_folder=Path(log_dir) / "logs",
cluster=mode,
update_parameters={"slurm_qos": "lcm_pretrain"},
update_parameters={"partition": "learn"},
)
_ = await launcher.schedule(inst_stopes_module)

Expand Down
6 changes: 4 additions & 2 deletions lcm/train/two_tower_diffusion_lcm/criterion.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
#
#

from dataclasses import dataclass
from dataclasses import dataclass, field
from typing import List, Tuple

import torch
Expand Down Expand Up @@ -33,7 +33,9 @@ class TowerDiffusionLCMCriterionConfig(LCMCriterionConfig):
Note that this requires the model to be set with
`trained_with_cf_guidance = True`!
"""
step_sampling: StepsSamplerConfig = StepsSamplerConfig()
step_sampling: StepsSamplerConfig = field(
default_factory=lambda: StepsSamplerConfig()
)

log_losses_per_timestep_bucket: bool = False

Expand Down
3 changes: 2 additions & 1 deletion scripts/prepare_wikipedia.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,8 +87,9 @@ def run(output_dir: Path):
cache=None,
cluster="local",
# for SLURM you can set some parameters of the launcher here
# cluster="slurm",
# update_parameters={
# "slurm_partition": "YOURPARTITION",
# "partition": "learn",
# },
)

Expand Down
2 changes: 1 addition & 1 deletion tests/units/training/test_get_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def __post_init__(self):
@dataclass
class Config:
foobar: str = "test"
cfg: Foo = Foo()
cfg: Foo = field(default_factory=lambda: Foo())
c: float = field(init=False)

def __post_init__(self):
Expand Down
6 changes: 3 additions & 3 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit fd7db80

Please sign in to comment.