Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update get_all_files_paths_under examples to include keep_extensions #450

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions docs/user-guide/distributeddataclassification.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Let's see how ``DomainClassifier`` works in a small excerpt taken from ``example

from nemo_curator.classifiers import DomainClassifier

files = get_all_files_paths_under("books_dataset/")
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

domain_classifier = DomainClassifier(filter_by=["Games", "Sports"])
Expand All @@ -83,7 +83,7 @@ Using the ``MultilingualDomainClassifier`` is very similar to using the ``Domain

from nemo_curator.classifiers import MultilingualDomainClassifier

files = get_all_files_paths_under("japanese_books_dataset/")
files = get_all_files_paths_under("japanese_books_dataset/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

multilingual_domain_classifier = MultilingualDomainClassifier(
Expand All @@ -106,7 +106,7 @@ Here's an example of how to use the ``QualityClassifier``:

from nemo_curator.classifiers import QualityClassifier

files = get_all_files_paths_under("web_documents/")
files = get_all_files_paths_under("web_documents/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

quality_classifier = QualityClassifier(filter_by=["High", "Medium"])
Expand Down Expand Up @@ -134,7 +134,7 @@ NeMo Curator provides an easy way to annotate and filter your data using the saf

.. code-block:: python

files = get_all_files_paths_under("unsafe_documents/")
files = get_all_files_paths_under("unsafe_documents/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

token = "hf_1234" # Replace with your user access token
Expand Down Expand Up @@ -181,7 +181,7 @@ Here is a small example of how to use the ``InstructionDataGuardClassifier``:

# The model expects instruction-response style text data. For example:
# "Instruction: {instruction}. Input: {input_}. Response: {response}."
files = get_all_files_paths_under("instruction_input_response_dataset/")
files = get_all_files_paths_under("instruction_input_response_dataset/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

token = "hf_1234" # Replace with your user access token
Expand Down Expand Up @@ -210,7 +210,7 @@ To use the FineWeb Educational Content Classifier, you can follow this example:

from nemo_curator.classifiers import FineWebEduClassifier

files = get_all_files_paths_under("web_documents/")
files = get_all_files_paths_under("web_documents/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

edu_classifier = FineWebEduClassifier(
Expand Down Expand Up @@ -247,7 +247,7 @@ Let's see how ``ContentTypeClassifier`` works in a small excerpt taken from ``ex

from nemo_curator.classifiers import ContentTypeClassifier

files = get_all_files_paths_under("books_dataset/")
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

content_type_classifier = ContentTypeClassifier(filter_by=["Blogs", "News"])
Expand All @@ -269,7 +269,7 @@ Here's an example of how to use the ``PromptTaskComplexityClassifier``:

from nemo_curator.classifiers import PromptTaskComplexityClassifier

files = get_all_files_paths_under("my_dataset/")
files = get_all_files_paths_under("my_dataset/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

classifier = PromptTaskComplexityClassifier()
Expand Down
4 changes: 2 additions & 2 deletions docs/user-guide/documentdataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ You could read, filter the dataset, and write it using the following methods
from nemo_curator.utils.file_utils import get_all_files_paths_under
from nemo_curator.filters import WordCountFilter

files = get_all_files_paths_under("books_dataset/")
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
books = DocumentDataset.read_json(files, add_filename=True)

filter_step = nc.ScoreFilter(
Expand All @@ -58,7 +58,7 @@ You could read, filter the dataset, and write it using the following methods

Let's walk through this code line by line.

* ``files = get_all_files_paths_under("books_dataset/")`` This retrieves a list of all files in the given directory.
* ``files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")`` This retrieves a list of all files in the given directory, then filters the list to include only files ending with ".jsonl".
In our case, this is equivalent to writing

.. code-block:: python
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/qualityfiltering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Let's examine this small example:
from nemo_curator.utils.file_utils import get_all_files_paths_under
from nemo_curator.filters import WordCountFilter

files = get_all_files_paths_under("books_dataset/")
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
books = DocumentDataset.read_json(files, add_filename=True)

filter_step = nc.ScoreFilter(
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/sparkother.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,4 +91,4 @@ The following code snippet demonstrates how to read output from a Spark DataFram
stories_dataset = DocumentDataset.read_parquet(processed_files, backend="pandas")

It is worth noting that Spark typically tends to create checksum and other marker files which can vary by Spark distribution,
so it is advisable to ignore them when reading data into a NeMo Curator ``DocumentDataset``.
so it is advisable to ignore them when reading data into a NeMo Curator ``DocumentDataset``.
2 changes: 1 addition & 1 deletion docs/user-guide/taskdecontamination.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Let's examine this small example:
from nemo_curator.utils.file_utils import get_all_files_paths_under
from nemo_curator.tasks import Winogrande, Squad, TriviaQA,

files = get_all_files_paths_under("books_dataset/")
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
books = DocumentDataset.read_json(files, add_filename=True)

downstream_tasks = [
Expand Down
2 changes: 1 addition & 1 deletion examples/classifier_filtering.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@


def load_dataset(input_data_dir):
files = list(get_all_files_paths_under(input_data_dir))
files = list(get_all_files_paths_under(input_data_dir, keep_extensions="jsonl"))
raw_data = read_data(files, file_type="jsonl", backend="pandas", add_filename=True)
dataset = DocumentDataset(raw_data)

Expand Down
3 changes: 1 addition & 2 deletions examples/exact_deduplication.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@

from nemo_curator.datasets import DocumentDataset
from nemo_curator.modules import ExactDuplicates
from nemo_curator.utils.distributed_utils import get_client, read_data, write_to_disk
from nemo_curator.utils.file_utils import get_all_files_paths_under
from nemo_curator.utils.distributed_utils import get_client, write_to_disk
from nemo_curator.utils.script_utils import ArgumentHelper


Expand Down
2 changes: 1 addition & 1 deletion examples/identify_languages_and_fix_unicode.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@


def load_dataset(input_data_dir):
files = list(get_all_files_paths_under(input_data_dir))
files = list(get_all_files_paths_under(input_data_dir, keep_extensions="jsonl"))
raw_data = read_data(files, file_type="jsonl", backend="pandas", add_filename=True)
dataset = DocumentDataset(raw_data)

Expand Down
2 changes: 1 addition & 1 deletion examples/task_decontamination.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@


def load_dataset(input_data_dir):
files = list(get_all_files_paths_under(input_data_dir))
files = list(get_all_files_paths_under(input_data_dir, keep_extensions="jsonl"))
raw_data = read_data(files, file_type="jsonl", backend="pandas", add_filename=True)
dataset = DocumentDataset(raw_data)

Expand Down
5 changes: 3 additions & 2 deletions nemo_curator/scripts/find_exact_duplicates.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,9 @@ def main(args):
if num_files is not None and num_files <= 0:
logger.info(f"Processed {num_files}... quitting")
break
files = get_all_files_paths_under(root=data_path, recurse_subdirectories=False)
files = [f for f in files if f.endswith(".jsonl")]
files = get_all_files_paths_under(
root=data_path, recurse_subdirectories=False, keep_extensions="jsonl"
)
df = read_data(
files[:num_files] if num_files else files,
file_type="jsonl",
Expand Down
5 changes: 3 additions & 2 deletions nemo_curator/scripts/fuzzy_deduplication/compute_minhashes.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,9 @@ def main(args):
print(f"Processed {args.num_files}... quitting")
break

files = get_all_files_paths_under(root=data_path, recurse_subdirectories=False)
files = [f for f in files if f.endswith(".jsonl")]
files = get_all_files_paths_under(
root=data_path, recurse_subdirectories=False, keep_extensions="jsonl"
)
df = read_data(
files[:num_files] if num_files else files,
file_type="jsonl",
Expand Down
4 changes: 3 additions & 1 deletion nemo_curator/scripts/prepare_fasttext_training_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,9 @@ def sample_rows(df, n, seed):
def main(args):
client = get_client(**ArgumentHelper.parse_client_args(args))
# Get local path
files = list(get_all_files_paths_under(args.input_data_dir))
files = list(
get_all_files_paths_under(args.input_data_dir, keep_extensions="jsonl")
)
raw_data = read_data(files, file_type="jsonl", backend="pandas")
dataset = DocumentDataset(raw_data)
text_field = args.input_json_field
Expand Down
2 changes: 1 addition & 1 deletion nemo_curator/utils/file_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -446,7 +446,7 @@ def reshard_jsonl(
# Output file size in bytes
blocksize = parse_str_of_num_bytes(output_file_size)

input_files = list(get_all_files_paths_under(input_dir))
input_files = list(get_all_files_paths_under(input_dir, keep_extensions="jsonl"))

# Read in the dask bag
b = db.read_text(input_files, blocksize=blocksize)
Expand Down
1 change: 0 additions & 1 deletion tests/test_read_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
read_data_blocksize,
read_data_files_per_partition,
)
from nemo_curator.utils.file_utils import get_all_files_paths_under

NUM_FILES = 5
NUM_RECORDS = 100
Expand Down
14 changes: 1 addition & 13 deletions tutorials/dapt-curation/code/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import json
import os
import re

import dask.dataframe as dd
import pandas as pd
import yaml

from nemo_curator import (
ExactDuplicates,
Expand All @@ -33,7 +27,6 @@
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import (
DocumentFilter,
RepeatedLinesFilter,
RepeatedParagraphsFilter,
RepeatingTopNGramsFilter,
UrlsFilter,
Expand All @@ -46,12 +39,7 @@
from nemo_curator.modifiers import DocumentModifier
from nemo_curator.modifiers.pii_modifier import PiiModifier
from nemo_curator.modifiers.unicode_reformatter import UnicodeReformatter
from nemo_curator.pii.constants import DEFAULT_LANGUAGE, DEFAULT_MAX_DOC_SIZE
from nemo_curator.utils.distributed_utils import get_client
from nemo_curator.utils.file_utils import (
expand_outdir_and_mkdir,
get_all_files_paths_under,
)
from nemo_curator.utils.file_utils import expand_outdir_and_mkdir


class QuotationUnifier(DocumentModifier):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -121,31 +121,19 @@
"source": [
"import os\n",
"import time\n",
"from dask.distributed import Client\n",
"import warnings\n",
"import dask.dataframe as dd\n",
"import dask_cudf\n",
"import cudf\n",
"import gzip\n",
"import json\n",
"import dask.bag as db\n",
"import glob\n",
"from dask.distributed import wait\n",
"import numpy as np\n",
"\n",
"from nemo_curator import get_client\n",
"from nemo_curator.datasets import DocumentDataset\n",
"from nemo_curator.utils.distributed_utils import (\n",
" get_num_workers,\n",
" read_data,\n",
" write_to_disk,\n",
")\n",
"from nemo_curator.utils.file_utils import (\n",
" expand_outdir_and_mkdir, \n",
" get_all_files_paths_under, \n",
" separate_by_metadata,\n",
" get_batched_files,\n",
")\n",
"\n",
"warnings.filterwarnings('ignore')\n",
"base_dir = \"/path/to/data\""
Expand Down Expand Up @@ -1473,8 +1461,9 @@
}
],
"source": [
"files = get_all_files_paths_under(root=input_data_dir, recurse_subdirectories=False)\n",
"files = [f for f in files if f.endswith(\".jsonl\")]\n",
"files = get_all_files_paths_under(\n",
" root=input_data_dir, recurse_subdirectories=False, keep_extensions=\"jsonl\"\n",
")\n",
"df = read_data(\n",
" files,\n",
" file_type=\"jsonl\",\n",
Expand Down
7 changes: 3 additions & 4 deletions tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -122,15 +122,13 @@
},
"outputs": [],
"source": [
"import argparse\n",
"import os\n",
"\n",
"from nemo_curator.utils.distributed_utils import get_client,get_num_workers\n",
"from nemo_curator.utils.file_utils import get_all_files_paths_under, separate_by_metadata\n",
"from nemo_curator.utils.distributed_utils import read_data,write_to_disk\n",
"from nemo_curator.datasets import DocumentDataset\n",
"\n",
"import sys\n",
"import pandas as pd\n",
"import time\n",
"import cudf\n",
Expand Down Expand Up @@ -1142,8 +1140,9 @@
"print(f\"Computing minhashes for {minhash_data_path}\")\n",
"\n",
"# Load data. Only the [minhash_id_field, text_field] columns are needed\n",
"files = get_all_files_paths_under(root=minhash_data_path, recurse_subdirectories=False)\n",
"files = [f for f in files if f.endswith(\".jsonl\")]\n",
"files = get_all_files_paths_under(\n",
" root=minhash_data_path, recurse_subdirectories=False, keep_extensions=\"jsonl\"\n",
")\n",
"df = read_data(\n",
" files,\n",
" file_type=\"jsonl\",\n",
Expand Down
6 changes: 3 additions & 3 deletions tutorials/tinystories/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,9 +176,9 @@ def run_curation_pipeline(args: Any, jsonl_dir: str) -> None:
client = get_client(**ArgumentHelper.parse_client_args(args))
print(f"Running curation pipeline on '{jsonl_dir}'...")
files = [
fp
for fp in get_all_files_paths_under(jsonl_dir, recurse_subdirectories=False)
if fp.endswith(".jsonl")
get_all_files_paths_under(
jsonl_dir, recurse_subdirectories=False, keep_extensions="jsonl"
)
]
print("Reading the data...")
orig_dataset = DocumentDataset.read_json(files, add_filename=True)
Expand Down
3 changes: 1 addition & 2 deletions tutorials/zyda2-tutorial/1_fuzzy_dedup/0_minhash.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@


def read_folder(input_folder, columns=["nemo_id", "text"]):
data_paths = get_all_files_paths_under(input_folder)
data_paths = [f for f in data_paths if f.endswith(".parquet")]
data_paths = get_all_files_paths_under(input_folder, keep_extensions="parquet")
data_paths.sort()
logging.info(f"Number of files being read: {len(data_paths)}")
text_ddf = dask_cudf.read_parquet(
Expand Down
Loading