-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added example notebook for translation with ct2 model. #262
base: main
Are you sure you want to change the base?
Added example notebook for translation with ct2 model. #262
Conversation
1b7e864
to
5ecd4fe
Compare
Signed-off-by: Ahmed Umair <[email protected]>
5ecd4fe
to
cef1c34
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a initial review, thanks for working on this tutorial.
"/home/uahmed/Desktop/NeMo-Curator/.ndc/lib/python3.10/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator LinearRegression from version 1.4.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n", | ||
"https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n", | ||
" warnings.warn(\n", | ||
"GPU: 0, Part: 0: 0%| | 0/10 [05:33<?, ?it/s]1.74s/it]\n", | ||
"GPU: 0, Part: 0: 100%|██████████| 10/10 [00:17<00:00, 1.77s/it]\n" | ||
] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given its a tutorial, we should show the output that writing it out, maybe try doing both .
"source": [ | ||
"import os\n", | ||
"\n", | ||
"os.environ[\"DASK_DATAFRAME__QUERY_PLANNING\"] = \"False\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this, no need to add it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this should be removed.
"metadata": {}, | ||
"source": [ | ||
"# Indic Translation\n", | ||
"This notebook demonstrate an example use of nemo-curator for Indic language generation via translation from English language. This workflow is accelarated by CrossFit, a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would suggest mentioning that we support large scale translation using multi node multi gpu setup
"# Indic Translation\n", | ||
"This notebook demonstrate an example use of nemo-curator for Indic language generation via translation from English language. This workflow is accelarated by CrossFit, a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets. \n", | ||
"This example uses ctransalte2 model from [here](https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/en-indic-preprint.zip), taken from IndicTrans2 github repo, [here](https://github.com/AI4Bharat/IndicTrans2/tree/main?tab=readme-ov-file#multilingual-translation-models)" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a clear demarkation of import section here
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Define output directory\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"## Define output directory\n" | |
"##### Define output directory\n" |
] | ||
}, | ||
{ | ||
"cell_type": "markdown", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a table of contents here.
See https://github.com/NVIDIA/NeMo-Curator/blob/cef1c3408684e9ce96e8125f041378407daf0a7b/tutorials/image-curation/image-curation.ipynb example for input
Signed-off-by: Ahmed Umair <[email protected]>
Hi @VibhuJawa / @sarahyurick , |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can leave a more detailed review for focusing on grammar nits, but for now can you please implement the code changes from my review, then rerun the entire notebook (executing from start to bottom and in order)?
"source": [ | ||
"import os\n", | ||
"\n", | ||
"os.environ[\"DASK_DATAFRAME__QUERY_PLANNING\"] = \"False\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this should be removed.
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"@dataclass\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about removing this entire cell and creating a new script in https://github.com/NVIDIA/NeMo-Curator/tree/main/nemo_curator/classifiers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cant really have another classifier just yet, we have to have this as a tutorial as we dont want to support this globally just yet. We can probably create a utils file here to make this cleaner though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok I see, using a utils file sounds good to me.
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"class IndicTranslation(DistributedDataClassifier):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above, why don't we move all of this logic into https://github.com/NVIDIA/NeMo-Curator/tree/main/nemo_curator/classifiers?
Description
This PR is adding a translation example to hindi languge via ct2 model.
Issue here
This example depends on CrossFit's PR
Checklist