Skip to content

Commit

Permalink
Add citation
Browse files Browse the repository at this point in the history
Signed-off-by: Ryan Wolf <[email protected]>
  • Loading branch information
ryantwolf committed Mar 26, 2024
1 parent ea680b0 commit 72230e2
Show file tree
Hide file tree
Showing 150 changed files with 7,353 additions and 5,909 deletions.
42 changes: 42 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: Test Python package
on:
push:
branches:
- main
pull_request:
workflow_dispatch:

# When this workflow is queued, automatically cancel any previous running
# or pending jobs from the same branch
concurrency:
group: test-${{ github.ref }}
cancel-in-progress: true

jobs:
build_and_test:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest]
python-version: ["3.10"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install NeMo-Curator and pytest
# TODO: Remove pytest when optional test dependencies are added to setup.py

# Installing wheel beforehand due to fasttext issue:
# https://github.com/facebookresearch/fastText/issues/512#issuecomment-1837367666
# Explicitly install cython: https://github.com/VKCOM/YouTokenToMe/issues/94
run: |
pip install wheel cython
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com .
pip install pytest
- name: Run tests
# TODO: Remove env variable when gpu dependencies are optional
run: |
RAPIDS_NO_INITIALIZE=1 python -m pytest -v --cpu
47 changes: 47 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

default_language_version:
python: python3

ci:
autofix_prs: true
autoupdate_commit_msg: '[pre-commit.ci] pre-commit suggestions'
autoupdate_schedule: quarterly

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: check-added-large-files
args: ['--maxkb=1000']
- id: check-case-conflict
- id: check-yaml
- id: detect-private-key
- id: end-of-file-fixer
- id: requirements-txt-fixer
- id: trailing-whitespace

- repo: https://github.com/psf/black
rev: 24.3.0
hooks:
- id: black
name: Format code

- repo: https://github.com/PyCQA/isort
rev: 5.13.2
hooks:
- id: isort
name: Format imports
exclude: docs/
3 changes: 0 additions & 3 deletions .style.yapf

This file was deleted.

25 changes: 25 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
cff-version: 1.0.0
message: "If you use this software, please cite it as below."
title: "NeMo-Curator: a toolkit for data curation"
repository-code: https://github.com/NVIDIA/NeMo-Curator
authors:
- family-names: Jennings
given-names: Joseph
- family-names: Patwary
given-names: Mostofa
- family-names: Subramanian
given-names: Sandeep
- family-names: Prabhumoye
given-names: Shrimai
- family-names: Dattagupta
given-names: Ayush
- family-names: Jawa
given-names: Vibhu
- family-names: Liu
given-names: Jiwei
- family-names: Wolf
given-names: Ryan
- family-names: Yurick
given-names: Sarah
- family-names: Singh
given-names: Varun
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ We use ``black`` as our style guide. To fix your format run `pip install pre-com
1. Minimize the use of ``**kwargs``.
1. ``RaiseError`` is preferred to ``assert``. Write: ```if X: raise Error``` instead of ```assert X```.
1. Classes are preferred to standalone methods.
1. Methods should be atomic. A method shouldn't be longer than 75 lines, e.g. can be fit into the computer screen without scrolling.
1. Methods should be atomic. A method shouldn't be longer than 88 lines, e.g. can be fit into the computer screen without scrolling.
1. If a method has arguments that don't fit into one line, each argument should be in its own line for readability.
1. Add ``__init__.py`` for every folder.
1. F-strings are prefered to formatted strings.
Expand Down
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ We currently support the following data-curation modules. For more details on ea
- [Text reformatting and cleaning](docs/user-guide/LanguageIdentificationUnicodeFormatting.rst)
- Fix unicode decoding errors via [ftfy](https://ftfy.readthedocs.io/en/latest/)
- [Quality filtering](docs/user-guide/QualityFiltering.rst)
- Multilingual heuristic-based filtering
- Multilingual heuristic-based filtering
- Classifier-based filtering via [fastText](https://fasttext.cc/)
- [Document-level deduplication](docs/user-guide/GpuDeduplication.rst)
- Both exact and fuzzy deduplication are accelerated using cuDF and Dask.
Expand Down Expand Up @@ -43,7 +43,9 @@ NeMo Curator can be installed manually by cloning the repository and installing
```
pip install --extra-index-url https://pypi.nvidia.com .
```
NeMo Curator is available in the [NeMo Framework Container](https://registry.ngc.nvidia.com/orgs/ea-bignlp/teams/ga-participants/containers/nemofw-training) which can be applied for [here](https://developer.nvidia.com/nemo-framework). It comes preinstalled in the container.
### NeMo Framework Container

NeMo Curator is available in the [NeMo Framework Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo). The NeMo Framework Container provides an end-to-end platform for development of custom generative AI models anywhere. The latest release of NeMo Curator comes preinstalled in the container.

## Usage

Expand Down Expand Up @@ -77,7 +79,7 @@ Note: This is not the only way to run NeMo Curator on SLURM. There are example s

## Module Ablation and Compute Performance

The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so
The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so
in a scalable manner. In order to assess the quality of the Common Crawl documents curated by the modules in NeMo Curator, we performed a series
of ablation experiments in which we trained a 357M-parameter GPT-style model on the datasets resulting from the different stages of our data curation
pipeline implemented in NeMo Curator. The figure below demonstrates that the different data curation modules implemented within NeMo Curator
Expand All @@ -87,7 +89,7 @@ lead to improved model zero-shot downstream task performance.
<img src="./docs/user-guide/images/zeroshot_ablations.png" alt="drawing" width="700"/>
</p>

In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.
In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.

Additionally, using the CPU-based modules the table below shows the time required and resulting data size reduction of each step of processing the [Common Crawl snapshot from November/December of 2020](https://commoncrawl.org/2020/12/nov-dec-2020-crawl-archive-now-available/) using 30 CPU nodes (with hardware similar to the `c5.24xlarge` [Amazon AWS C5 instance](https://aws.amazon.com/ec2/instance-types/c5/)):

Expand Down Expand Up @@ -126,4 +128,4 @@ Additionally, using the CPU-based modules the table below shows the time require

As mentioned above, the modules within NeMo Curator enable users to scale data-mining and NLP processing tasks to many nodes within a compute cluster.
The modules accomplish this using [Dask](https://www.dask.org/) with [cuDF](https://docs.rapids.ai/api/cudf/nightly/user_guide/10min/) (for the GPU-accelerated modules).
At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
2 changes: 1 addition & 1 deletion SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@ While NVIDIA currently does not have a bug bounty program, we do offer acknowled

## NVIDIA Product Security

For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
4 changes: 2 additions & 2 deletions config/arxiv_builder.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
download_module: nemo_curator.download.arxiv.ArxivDownloader
download_params: {}
iterator_module: nemo_curator.download.arxiv.ArxivIterator
iterator_params:
iterator_params:
log_frequency: 1000
extract_module: nemo_curator.download.arxiv.ArxivExtractor
extract_params: {}
format:
text: str
id: str
source_id: str
source_id: str
2 changes: 1 addition & 1 deletion config/cc_warc_builder.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ format:
language: str
url: str
warc_id: str
source_id: str
source_id: str
2 changes: 1 addition & 1 deletion config/heuristic_filter_code.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
input_field: text
filters:
# The filters below define a chain of heuristic filters to be applied to each document in a corpus.
# This particular cascade of filters is intended to filter Python code data.
# This particular cascade of filters is intended to filter Python code data.
# The filter listed at the top will be applied first, and the following filters will be applied in
# the order they appear in this file. Each filter can be removed and re-ordered as desired.
# Change this based on the language of the data
Expand Down
18 changes: 9 additions & 9 deletions config/heuristic_filter_en.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
input_field: text
filters:
# The filters below define a chain of heuristic filters to be applied to each document in a corpus.
# This particular cascade of filters is intended to filter English language data.
# This particular cascade of filters is intended to filter English language data.
# The filter listed at the top will be applied first, and the following filters will be applied in
# the order they appear in this file. Each filter can be removed and re-ordered as desired.
- name: nemo_curator.filters.heuristic_filter.NonAlphaNumericFilter
Expand All @@ -14,16 +14,16 @@ filters:
params:
max_number_to_text_ratio: 0.15
- name: nemo_curator.filters.heuristic_filter.UrlsFilter
params:
params:
max_url_to_text_ratio: 0.2
- name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
params:
params:
max_white_space_ratio: 0.25
- name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
params:
params:
max_parentheses_ratio: 0.1
- name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
params:
params:
remove_if_at_top_or_bottom: True
max_boilerplate_string_ratio: 0.4
- name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
Expand All @@ -46,18 +46,18 @@ filters:
params:
max_num_sentences_without_endmark_ratio: 0.85
- name: nemo_curator.filters.heuristic_filter.WordsWithoutAlphabetsFilter
params:
params:
min_words_with_alphabets: 0.8
- name: nemo_curator.filters.heuristic_filter.CommonEnglishWordsFilter
params:
min_num_common_words: 2
stop_at_false: True
- name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
params:
max_mean_word_length: 10
max_mean_word_length: 10
min_mean_word_length: 3
- name: nemo_curator.filters.heuristic_filter.LongWordFilter
params:
params:
max_word_length: 1000
- name: nemo_curator.filters.heuristic_filter.EllipsisFilter
params:
Expand Down Expand Up @@ -102,4 +102,4 @@ filters:
max_repeating_duplicate_ngram_ratio: 0.10
- name: nemo_curator.filters.heuristic_filter.BulletsFilter
params:
max_bullet_lines_ratio: 0.9
max_bullet_lines_ratio: 0.9
18 changes: 9 additions & 9 deletions config/heuristic_filter_non-en.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
input_field: text
filters:
# The filters below define a chain of heuristic filters to be applied to each document in a corpus.
# This particular cascade of filters is intended to filter generic non-English data that use spaces for separating words.
# This particular cascade of filters is intended to filter generic non-English data that use spaces for separating words.
# The filter listed at the top will be applied first, and the following filters will be applied in
# the order they appear in this file. Each filter can be removed and re-ordered as desired.
- name: nemo_curator.filters.heuristic_filter.SymbolsToWordsFilter
Expand All @@ -11,16 +11,16 @@ filters:
params:
max_number_to_text_ratio: 0.15
- name: nemo_curator.filters.heuristic_filter.UrlsFilter
params:
params:
max_url_to_text_ratio: 0.2
- name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
params:
params:
max_white_space_ratio: 0.25
- name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
params:
params:
max_parentheses_ratio: 0.1
- name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
params:
params:
remove_if_at_top_or_bottom: True
max_boilerplate_string_ratio: 0.4
- name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
Expand All @@ -39,17 +39,17 @@ filters:
params:
min_words: 50
max_words: 100000
# NOTE: This filter tends to remove many documents and will need to
# NOTE: This filter tends to remove many documents and will need to
# be tuned per language
- name: nemo_curator.filters.heuristic_filter.PunctuationFilter
params:
max_num_sentences_without_endmark_ratio: 0.85
- name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
params:
max_mean_word_length: 10
max_mean_word_length: 10
min_mean_word_length: 3
- name: nemo_curator.filters.heuristic_filter.LongWordFilter
params:
params:
max_word_length: 1000
- name: nemo_curator.filters.heuristic_filter.EllipsisFilter
params:
Expand Down Expand Up @@ -94,4 +94,4 @@ filters:
max_repeating_duplicate_ngram_ratio: 0.10
- name: nemo_curator.filters.heuristic_filter.BulletsFilter
params:
max_bullet_lines_ratio: 0.9
max_bullet_lines_ratio: 0.9
2 changes: 1 addition & 1 deletion config/lm_tasks.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
tasks:
# The Python modules below define language model downstream evaluation
# task data. If one of the below tasks is specified, N-grams will
# task data. If one of the below tasks is specified, N-grams will
# be constructed from the documents that make up the task data
# using the script prepare_task_data.
# find_matching_ngrams will then search for these N-grams
Expand Down
2 changes: 1 addition & 1 deletion config/pii_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ pii_config:
#type: 'hash'
#hash_type: 'sha256'

#type: 'redact'
#type: 'redact'
2 changes: 1 addition & 1 deletion config/wikipedia_builder.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ format:
id: str
url: str
language: str
source_id: str
source_id: str
15 changes: 15 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import pytest


def pytest_addoption(parser):
parser.addoption(
"--cpu", action="store_true", default=False, help="Run tests without gpu marker"
)


def pytest_collection_modifyitems(config, items):
if config.getoption("--cpu"):
skip_gpu = pytest.mark.skip(reason="Skipping GPU tests")
for item in items:
if "gpu" in item.keywords:
item.add_marker(skip_gpu)
2 changes: 1 addition & 1 deletion docs/user-guide/CPUvsGPU.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,4 +95,4 @@ Every SLURM cluster is different, so make sure you understand how your SLURM clu
``start-slurm.sh`` calls ``containter-entrypoint.sh`` which sets up a Dask scheduler and workers across the cluster.

Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` to run on multiple nodes.
You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
4 changes: 2 additions & 2 deletions docs/user-guide/DistributedDataClassification.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Background

When preparing text data to be used in training a large language model (LLM), it is useful to classify
text documents in various ways, to enhance the LLM's performance by making it able to produce more
contextually appropriate and accurate language across various subjects. NeMo Curator provides this module to
contextually appropriate and accurate language across various subjects. NeMo Curator provides this module to
help a user run inference with pre-trained models on large amounts of text documents. We achieve
this by chunking the datasets across multiple computing nodes, each equipped with multiple GPUs, to
accelerate the classification task in a distributed way. In other words, because the classification of
Expand Down Expand Up @@ -68,4 +68,4 @@ The key differences is that it operates on the GPU instead of the CPU.
Therefore, the Dask cluster must be started as a GPU one.
And, ``DomainClassifier`` requires ``DocumentDataset`` to be on the GPU (i.e., have ``backend=cudf``).
It is easy to extend ``DistributedDataClassifier`` to your own model.
Check out ``nemo_curator.modules.distributed_data_classifier.py`` for reference.
Check out ``nemo_curator.modules.distributed_data_classifier.py`` for reference.
Loading

0 comments on commit 72230e2

Please sign in to comment.