forked from NVIDIA/NeMo-Curator
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
…IDIA#332) * save progress Signed-off-by: Sarah Yurick <[email protected]> * add remaining docs Signed-off-by: Sarah Yurick <[email protected]> * add titles and table Signed-off-by: Sarah Yurick <[email protected]> * remove trailing whitespace Signed-off-by: Sarah Yurick <[email protected]> * add --help instructions Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]>
- Loading branch information
1 parent
bc724ec
commit d1f52f6
Showing
18 changed files
with
208 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# NeMo Curator Python API examples | ||
|
||
This directory contains multiple Python scripts with examples of how to use various NeMo Curator classes and functions. | ||
The goal of these examples is to give the user an overview of many of the ways your text data can be curated. | ||
These include: | ||
|
||
| Python Script | Description | | ||
|---------------------------------------|---------------------------------------------------------------------------------------------------------------| | ||
| blend_and_shuffle.py | Combine multiple datasets into one with different amounts of each dataset, then randomly permute the dataset. | | ||
| classifier_filtering.py | Train a fastText classifier, then use it to filter high and low quality data. | | ||
| download_arxiv.py | Download Arxiv tar files and extract them. | | ||
| download_common_crawl.py | Download Common Crawl WARC snapshots and extract them. | | ||
| download_wikipedia.py | Download the latest Wikipedia dumps and extract them. | | ||
| exact_deduplication.py | Use the `ExactDuplicates` class to perform exact deduplication on text data. | | ||
| find_pii_and_deidentify.py | Use the `PiiModifier` and `Modify` classes to remove personally identifiable information from text data. | | ||
| fuzzy_deduplication.py | Use the `FuzzyDuplicatesConfig` and `FuzzyDuplicates` classes to perform fuzzy deduplication on text data. | | ||
| identify_languages_and_fix_unicode.py | Use `FastTextLangId` to filter data by language, then fix the unicode in it. | | ||
| raw_download_common_crawl.py | Download the raw compressed WARC files from Common Crawl without extracting them. | | ||
| semdedup_example.py | Use the `SemDedup` class to perform semantic deduplication on text data. | | ||
| task_decontamination.py | Remove segments of downstream evaluation tasks from a dataset. | | ||
| translation_example.py | Create and use an `IndicTranslation` model for language translation. | | ||
|
||
Before running any of these scripts, we strongly recommend displaying `python <script name>.py --help` to ensure that any needed or relevant arguments are specified. | ||
|
||
The `classifiers`, `k8s`, `nemo_run`, and `slurm` subdirectories contain even more examples of NeMo Curator's capabilities. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
## Text Classification | ||
|
||
The Python scripts in this directory demonstrate how to run classification on your text data with each of these 4 classifiers: | ||
|
||
- Domain Classifier | ||
- Quality Classifier | ||
- AEGIS Safety Models | ||
- FineWeb Educational Content Classifier | ||
|
||
For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html). | ||
|
||
Each of these scripts provide simple examples of what your own Python scripts might look like. | ||
|
||
At a high level, you will: | ||
|
||
1. Create a Dask client by using the `get_client` function | ||
2. Use `DocumentDataset.read_json` (or `DocumentDataset.read_parquet`) to read your data | ||
3. Initialize and call the classifier on your data | ||
4. Write your results to the desired output type with `to_json` or `to_parquet` | ||
|
||
Before running any of these scripts, we strongly recommend displaying `python <script name>.py --help` to ensure that any needed or relevant arguments are specified. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
## Kubernetes | ||
|
||
The `create_dask_cluster.py` can be used to create a CPU or GPU Dask cluster. | ||
|
||
See [Running NeMo Curator on Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/kubernetescurator.html) for more information. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
## NeMo-Run | ||
|
||
The `launch_slurm.py` script shows an example of how to run a Slurm job via Python APIs. | ||
|
||
See the [Dask with Slurm](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/cpuvsgpu.html?highlight=slurm#dask-with-slurm) and [NeMo-Run Quickstart](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html?highlight=slurm#execute-on-a-slurm-cluster) pages for more information. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Dask with Slurm | ||
|
||
This directory provides an example Slurm script pipeline. | ||
This pipeline has a script `start-slurm.sh` that provides configuration options similar to what `get_client` provides. | ||
Every Slurm cluster is different, so make sure you understand how your Slurm cluster works so the scripts can be easily adapted. | ||
`start-slurm.sh` calls `containter-entrypoint.sh`, which sets up a Dask scheduler and workers across the cluster. | ||
|
||
Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the `start-slurm.sh` script to run on multiple nodes. | ||
You can adapt your scripts easily too by simply following the pattern of adding `get_client` with `add_distributed_args`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# NeMo Curator CLI Scripts | ||
|
||
The following Python scripts are designed to be executed from the command line (terminal) only. | ||
|
||
Here, we list all of the Python scripts and their terminal commands: | ||
|
||
| Python Command | CLI Command | | ||
|------------------------------------------|--------------------------------| | ||
| python add_id.py | add_id | | ||
| python blend_datasets.py | blend_datasets | | ||
| python download_and_extract.py | download_and_extract | | ||
| python filter_documents.py | filter_documents | | ||
| python find_exact_duplicates.py | gpu_exact_dups | | ||
| python find_matching_ngrams.py | find_matching_ngrams | | ||
| python find_pii_and_deidentify.py | deidentify | | ||
| python get_common_crawl_urls.py | get_common_crawl_urls | | ||
| python get_wikipedia_urls.py | get_wikipedia_urls | | ||
| python make_data_shards.py | make_data_shards | | ||
| python prepare_fasttext_training_data.py | prepare_fasttext_training_data | | ||
| python prepare_task_data.py | prepare_task_data | | ||
| python remove_matching_ngrams.py | remove_matching_ngrams | | ||
| python separate_by_metadata.py | separate_by_metadata | | ||
| python text_cleaning.py | text_cleaning | | ||
| python train_fasttext.py | train_fasttext | | ||
| python verify_classification_results.py | verify_classification_results | | ||
|
||
For more information about the arguments needed for each script, you can use `add_id --help`, etc. | ||
|
||
More scripts can be found in the `classifiers`, `fuzzy_deduplication`, and `semdedup` subdirectories. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
## Text Classification | ||
|
||
The Python scripts in this directory demonstrate how to run classification on your text data with each of these 4 classifiers: | ||
|
||
- Domain Classifier | ||
- Quality Classifier | ||
- AEGIS Safety Models | ||
- FineWeb Educational Content Classifier | ||
|
||
For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html). | ||
|
||
### Usage | ||
|
||
#### Domain classifier inference | ||
|
||
```bash | ||
# same as `python domain_classifier_inference.py` | ||
domain_classifier_inference \ | ||
--input-data-dir /path/to/data/directory \ | ||
--output-data-dir /path/to/output/directory \ | ||
--input-file-type "jsonl" \ | ||
--input-file-extension "jsonl" \ | ||
--output-file-type "jsonl" \ | ||
--input-text-field "text" \ | ||
--batch-size 64 \ | ||
--autocast \ | ||
--max-chars 2000 \ | ||
--device "gpu" | ||
``` | ||
|
||
Additional arguments may be added for customizing a Dask cluster and client. Run `domain_classifier_inference --help` for more information. | ||
|
||
#### Quality classifier inference | ||
|
||
```bash | ||
# same as `python quality_classifier_inference.py` | ||
quality_classifier_inference \ | ||
--input-data-dir /path/to/data/directory \ | ||
--output-data-dir /path/to/output/directory \ | ||
--input-file-type "jsonl" \ | ||
--input-file-extension "jsonl" \ | ||
--output-file-type "jsonl" \ | ||
--input-text-field "text" \ | ||
--batch-size 64 \ | ||
--autocast \ | ||
--max-chars 2000 \ | ||
--device "gpu" | ||
``` | ||
|
||
Additional arguments may be added for customizing a Dask cluster and client. Run `quality_classifier_inference --help` for more information. | ||
|
||
#### AEGIS classifier inference | ||
|
||
```bash | ||
# same as `python aegis_classifier_inference.py` | ||
aegis_classifier_inference \ | ||
--input-data-dir /path/to/data/directory \ | ||
--output-data-dir /path/to/output/directory \ | ||
--input-file-type "jsonl" \ | ||
--input-file-extension "jsonl" \ | ||
--output-file-type "jsonl" \ | ||
--input-text-field "text" \ | ||
--batch-size 64 \ | ||
--max-chars 6000 \ | ||
--device "gpu" \ | ||
--aegis-variant "nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0" \ | ||
--token "hf_1234" | ||
``` | ||
|
||
- `--aegis-variant` can be `nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0`, `nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0`, or a path to your own PEFT of LlamaGuard 2. | ||
- `--token` is your HuggingFace token, which is used when downloading the base Llama Guard model. | ||
|
||
Additional arguments may be added for customizing a Dask cluster and client. Run `aegis_classifier_inference --help` for more information. | ||
|
||
#### FineWeb-Edu classifier inference | ||
|
||
```bash | ||
# same as `python fineweb_edu_classifier_inference.py` | ||
fineweb_edu_classifier_inference \ | ||
--input-data-dir /path/to/data/directory \ | ||
--output-data-dir /path/to/output/directory \ | ||
--input-file-type "jsonl" \ | ||
--input-file-extension "jsonl" \ | ||
--output-file-type "jsonl" \ | ||
--input-text-field "text" \ | ||
--batch-size 64 \ | ||
--autocast \ | ||
--max-chars 2000 \ | ||
--device "gpu" | ||
``` | ||
|
||
Additional arguments may be added for customizing a Dask cluster and client. Run `fineweb_edu_classifier_inference --help` for more information. |
Oops, something went wrong.