Add citation

Signed-off-by: Ryan Wolf <[email protected]>
NVIDIA · Mar 26, 2024 · 72230e2 · 72230e2
1 parent ea680b0
commit 72230e2
Show file tree

Hide file tree

Showing 150 changed files with 7,353 additions and 5,909 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -0,0 +1,42 @@
+name: Test Python package
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+  workflow_dispatch:
+
+# When this workflow is queued, automatically cancel any previous running
+# or pending jobs from the same branch
+concurrency:
+  group: test-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  build_and_test:
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os: [ubuntu-latest]
+        python-version: ["3.10"]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install NeMo-Curator and pytest
+        # TODO: Remove pytest when optional test dependencies are added to setup.py
+
+        # Installing wheel beforehand due to fasttext issue:
+        # https://github.com/facebookresearch/fastText/issues/512#issuecomment-1837367666
+        # Explicitly install cython: https://github.com/VKCOM/YouTokenToMe/issues/94
+        run: |
+          pip install wheel cython
+          pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com .
+          pip install pytest
+      - name: Run tests
+        # TODO: Remove env variable when gpu dependencies are optional
+        run: |
+          RAPIDS_NO_INITIALIZE=1 python -m pytest -v --cpu
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,47 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+default_language_version:
+  python: python3
+
+ci:
+  autofix_prs: true
+  autoupdate_commit_msg: '[pre-commit.ci] pre-commit suggestions'
+  autoupdate_schedule: quarterly
+
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.5.0
+    hooks:
+      - id: check-added-large-files
+        args: ['--maxkb=1000']
+      - id: check-case-conflict
+      - id: check-yaml
+      - id: detect-private-key
+      - id: end-of-file-fixer
+      - id: requirements-txt-fixer
+      - id: trailing-whitespace
+
+  - repo: https://github.com/psf/black
+    rev: 24.3.0
+    hooks:
+      - id: black
+        name: Format code
+
+  - repo: https://github.com/PyCQA/isort
+    rev: 5.13.2
+    hooks:
+      - id: isort
+        name: Format imports
+        exclude: docs/
diff --git a/.style.yapf b/.style.yapf
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,25 @@
+cff-version: 1.0.0
+message: "If you use this software, please cite it as below."
+title: "NeMo-Curator: a toolkit for data curation"
+repository-code: https://github.com/NVIDIA/NeMo-Curator
+authors:
+  - family-names: Jennings
+    given-names: Joseph
+  - family-names: Patwary
+    given-names: Mostofa
+  - family-names: Subramanian
+    given-names: Sandeep
+  - family-names: Prabhumoye
+    given-names: Shrimai
+  - family-names: Dattagupta
+    given-names: Ayush
+  - family-names: Jawa
+    given-names: Vibhu
+  - family-names: Liu
+    given-names: Jiwei
+  - family-names: Wolf
+    given-names: Ryan
+  - family-names: Yurick
+    given-names: Sarah
+  - family-names: Singh
+    given-names: Varun
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -52,7 +52,7 @@ We use ``black`` as our style guide. To fix your format run `pip install pre-com
 1. Minimize the use of ``**kwargs``.
 1. ``RaiseError`` is preferred to ``assert``. Write: ```if X: raise Error``` instead of ```assert X```.
 1. Classes are preferred to standalone methods.
-1. Methods should be atomic. A method shouldn't be longer than 75 lines, e.g. can be fit into the computer screen without scrolling.
+1. Methods should be atomic. A method shouldn't be longer than 88 lines, e.g. can be fit into the computer screen without scrolling.
 1. If a method has arguments that don't fit into one line, each argument should be in its own line for readability.
 1. Add ``__init__.py`` for every folder.
 1. F-strings are prefered to formatted strings.

diff --git a/README.md b/README.md
@@ -14,7 +14,7 @@ We currently support the following data-curation modules. For more details on ea
  - [Text reformatting and cleaning](docs/user-guide/LanguageIdentificationUnicodeFormatting.rst)
    - Fix unicode decoding errors via [ftfy](https://ftfy.readthedocs.io/en/latest/)
  - [Quality filtering](docs/user-guide/QualityFiltering.rst)
-   - Multilingual heuristic-based filtering 
+   - Multilingual heuristic-based filtering
    - Classifier-based filtering via [fastText](https://fasttext.cc/)
  - [Document-level deduplication](docs/user-guide/GpuDeduplication.rst)
    - Both exact and fuzzy deduplication are accelerated using cuDF and Dask.
@@ -43,7 +43,9 @@ NeMo Curator can be installed manually by cloning the repository and installing
 ```
 pip install --extra-index-url https://pypi.nvidia.com .
 ```
-NeMo Curator is available in the [NeMo Framework Container](https://registry.ngc.nvidia.com/orgs/ea-bignlp/teams/ga-participants/containers/nemofw-training) which can be applied for [here](https://developer.nvidia.com/nemo-framework). It comes preinstalled in the container.
+### NeMo Framework Container
+
+NeMo Curator is available in the [NeMo Framework Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo). The NeMo Framework Container provides an end-to-end platform for development of custom generative AI models anywhere. The latest release of NeMo Curator comes preinstalled in the container.
 
 ## Usage
 
@@ -77,7 +79,7 @@ Note: This is not the only way to run NeMo Curator on SLURM. There are example s
 
 ## Module Ablation and Compute Performance
 
-The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so 
+The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so
 in a scalable manner. In order to assess the quality of the Common Crawl documents curated by the modules in NeMo Curator, we performed a series
 of ablation experiments in which we trained a 357M-parameter GPT-style model on the datasets resulting from the different stages of our data curation
 pipeline implemented in NeMo Curator. The figure below demonstrates that the different data curation modules implemented within NeMo Curator
@@ -87,7 +89,7 @@ lead to improved model zero-shot downstream task performance.
   <img src="./docs/user-guide/images/zeroshot_ablations.png" alt="drawing" width="700"/>
 </p>
 
-In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s. 
+In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.
 
 Additionally, using the CPU-based modules the table below shows the time required and resulting data size reduction of each step of processing the [Common Crawl snapshot from November/December of 2020](https://commoncrawl.org/2020/12/nov-dec-2020-crawl-archive-now-available/) using 30 CPU nodes (with hardware similar to the `c5.24xlarge` [Amazon AWS C5 instance](https://aws.amazon.com/ec2/instance-types/c5/)):
 
@@ -126,4 +128,4 @@ Additionally, using the CPU-based modules the table below shows the time require
 
 As mentioned above, the modules within NeMo Curator enable users to scale data-mining and NLP processing tasks to many nodes within a compute cluster.
 The modules accomplish this using [Dask](https://www.dask.org/) with [cuDF](https://docs.rapids.ai/api/cudf/nightly/user_guide/10min/) (for the GPU-accelerated modules).
-At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
+At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
diff --git a/SECURITY.md b/SECURITY.md
@@ -21,4 +21,4 @@ While NVIDIA currently does not have a bug bounty program, we do offer acknowled
 
 ## NVIDIA Product Security
 
-For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
+For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
diff --git a/config/arxiv_builder.yaml b/config/arxiv_builder.yaml
@@ -1,11 +1,11 @@
 download_module: nemo_curator.download.arxiv.ArxivDownloader
 download_params: {}
 iterator_module: nemo_curator.download.arxiv.ArxivIterator
-iterator_params: 
+iterator_params:
   log_frequency: 1000
 extract_module: nemo_curator.download.arxiv.ArxivExtractor
 extract_params: {}
 format:
   text: str
   id: str
-  source_id: str
+  source_id: str
diff --git a/config/cc_warc_builder.yaml b/config/cc_warc_builder.yaml
@@ -9,4 +9,4 @@ format:
   language: str
   url: str
   warc_id: str
-  source_id: str
+  source_id: str
diff --git a/config/heuristic_filter_code.yaml b/config/heuristic_filter_code.yaml
@@ -1,7 +1,7 @@
 input_field: text
 filters:
   # The filters below define a chain of heuristic filters to be applied to each document in a corpus.
-  # This particular cascade of filters is intended to filter Python code data. 
+  # This particular cascade of filters is intended to filter Python code data.
   # The filter listed at the top will be applied first, and the following filters will be applied in
   # the order they appear in this file. Each filter can be removed and re-ordered as desired.
   # Change this based on the language of the data

diff --git a/config/heuristic_filter_en.yaml b/config/heuristic_filter_en.yaml
@@ -1,7 +1,7 @@
 input_field: text
 filters:
   # The filters below define a chain of heuristic filters to be applied to each document in a corpus.
-  # This particular cascade of filters is intended to filter English language data. 
+  # This particular cascade of filters is intended to filter English language data.
   # The filter listed at the top will be applied first, and the following filters will be applied in
   # the order they appear in this file. Each filter can be removed and re-ordered as desired.
   - name: nemo_curator.filters.heuristic_filter.NonAlphaNumericFilter
@@ -14,16 +14,16 @@ filters:
     params:
       max_number_to_text_ratio: 0.15
   - name: nemo_curator.filters.heuristic_filter.UrlsFilter
-    params: 
+    params:
       max_url_to_text_ratio: 0.2
   - name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
-    params: 
+    params:
       max_white_space_ratio: 0.25
   - name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
-    params: 
+    params:
       max_parentheses_ratio: 0.1
   - name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
-    params: 
+    params:
       remove_if_at_top_or_bottom: True
       max_boilerplate_string_ratio: 0.4
   - name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
@@ -46,18 +46,18 @@ filters:
     params:
       max_num_sentences_without_endmark_ratio: 0.85
   - name: nemo_curator.filters.heuristic_filter.WordsWithoutAlphabetsFilter
-    params: 
+    params:
       min_words_with_alphabets: 0.8
   - name: nemo_curator.filters.heuristic_filter.CommonEnglishWordsFilter
     params:
       min_num_common_words: 2
       stop_at_false: True
   - name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
     params:
-      max_mean_word_length: 10 
+      max_mean_word_length: 10
       min_mean_word_length: 3
   - name: nemo_curator.filters.heuristic_filter.LongWordFilter
-    params: 
+    params:
       max_word_length: 1000
   - name: nemo_curator.filters.heuristic_filter.EllipsisFilter
     params:
@@ -102,4 +102,4 @@ filters:
       max_repeating_duplicate_ngram_ratio: 0.10
   - name: nemo_curator.filters.heuristic_filter.BulletsFilter
     params:
-      max_bullet_lines_ratio: 0.9
+      max_bullet_lines_ratio: 0.9
diff --git a/config/heuristic_filter_non-en.yaml b/config/heuristic_filter_non-en.yaml
@@ -1,7 +1,7 @@
 input_field: text
 filters:
   # The filters below define a chain of heuristic filters to be applied to each document in a corpus.
-  # This particular cascade of filters is intended to filter generic non-English data that use spaces for separating words. 
+  # This particular cascade of filters is intended to filter generic non-English data that use spaces for separating words.
   # The filter listed at the top will be applied first, and the following filters will be applied in
   # the order they appear in this file. Each filter can be removed and re-ordered as desired.
   - name: nemo_curator.filters.heuristic_filter.SymbolsToWordsFilter
@@ -11,16 +11,16 @@ filters:
     params:
       max_number_to_text_ratio: 0.15
   - name: nemo_curator.filters.heuristic_filter.UrlsFilter
-    params: 
+    params:
       max_url_to_text_ratio: 0.2
   - name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
-    params: 
+    params:
       max_white_space_ratio: 0.25
   - name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
-    params: 
+    params:
       max_parentheses_ratio: 0.1
   - name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
-    params: 
+    params:
       remove_if_at_top_or_bottom: True
       max_boilerplate_string_ratio: 0.4
   - name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
@@ -39,17 +39,17 @@ filters:
     params:
       min_words: 50
       max_words: 100000
-  # NOTE: This filter tends to remove many documents and will need to 
+  # NOTE: This filter tends to remove many documents and will need to
   # be tuned per language
   - name: nemo_curator.filters.heuristic_filter.PunctuationFilter
     params:
       max_num_sentences_without_endmark_ratio: 0.85
   - name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
     params:
-      max_mean_word_length: 10 
+      max_mean_word_length: 10
       min_mean_word_length: 3
   - name: nemo_curator.filters.heuristic_filter.LongWordFilter
-    params: 
+    params:
       max_word_length: 1000
   - name: nemo_curator.filters.heuristic_filter.EllipsisFilter
     params:
@@ -94,4 +94,4 @@ filters:
       max_repeating_duplicate_ngram_ratio: 0.10
   - name: nemo_curator.filters.heuristic_filter.BulletsFilter
     params:
-      max_bullet_lines_ratio: 0.9
+      max_bullet_lines_ratio: 0.9
diff --git a/config/lm_tasks.yaml b/config/lm_tasks.yaml
@@ -1,6 +1,6 @@
 tasks:
   # The Python modules below define language model downstream evaluation
-  # task data. If one of the below tasks is specified, N-grams will 
+  # task data. If one of the below tasks is specified, N-grams will
   # be constructed from the documents that make up the task data
   # using the script prepare_task_data.
   # find_matching_ngrams will then search for these N-grams

diff --git a/config/pii_config.yaml b/config/pii_config.yaml
@@ -13,4 +13,4 @@ pii_config:
         #type: 'hash'
         #hash_type: 'sha256'
 
-        #type: 'redact'
+        #type: 'redact'
diff --git a/config/wikipedia_builder.yaml b/config/wikipedia_builder.yaml
@@ -12,4 +12,4 @@ format:
   id: str
   url: str
   language: str
-  source_id: str
+  source_id: str
diff --git a/conftest.py b/conftest.py
@@ -0,0 +1,15 @@
+import pytest
+
+
+def pytest_addoption(parser):
+    parser.addoption(
+        "--cpu", action="store_true", default=False, help="Run tests without gpu marker"
+    )
+
+
+def pytest_collection_modifyitems(config, items):
+    if config.getoption("--cpu"):
+        skip_gpu = pytest.mark.skip(reason="Skipping GPU tests")
+        for item in items:
+            if "gpu" in item.keywords:
+                item.add_marker(skip_gpu)
diff --git a/docs/user-guide/CPUvsGPU.rst b/docs/user-guide/CPUvsGPU.rst
@@ -95,4 +95,4 @@ Every SLURM cluster is different, so make sure you understand how your SLURM clu
 ``start-slurm.sh`` calls ``containter-entrypoint.sh`` which sets up a Dask scheduler and workers across the cluster.
 
 Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` to run on multiple nodes.
-You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
+You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
diff --git a/docs/user-guide/DistributedDataClassification.rst b/docs/user-guide/DistributedDataClassification.rst
@@ -8,7 +8,7 @@ Background
 
 When preparing text data to be used in training a large language model (LLM), it is useful to classify
 text documents in various ways, to enhance the LLM's performance by making it able to produce more
-contextually appropriate and accurate language across various subjects. NeMo Curator provides this module to 
+contextually appropriate and accurate language across various subjects. NeMo Curator provides this module to
 help a user run inference with pre-trained models on large amounts of text documents. We achieve
 this by chunking the datasets across multiple computing nodes, each equipped with multiple GPUs, to
 accelerate the classification task in a distributed way. In other words, because the classification of
@@ -68,4 +68,4 @@ The key differences is that it operates on the GPU instead of the CPU.
 Therefore, the Dask cluster must be started as a GPU one.
 And, ``DomainClassifier`` requires ``DocumentDataset`` to be on the GPU (i.e., have ``backend=cudf``).
 It is easy to extend ``DistributedDataClassifier`` to your own model.
-Check out ``nemo_curator.modules.distributed_data_classifier.py`` for reference.
+Check out ``nemo_curator.modules.distributed_data_classifier.py`` for reference.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -21,4 +21,4 @@ While NVIDIA currently does not have a bug bounty program, we do offer acknowled

		## NVIDIA Product Security

		For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
		For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security