Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Error: While Preprocessing Tokeinzer for Urdu Language #4

Open
HSultankhan opened this issue Oct 4, 2023 · 0 comments
Open

Comments

@HSultankhan
Copy link

Hello , I want to create a tokenizer for urdu language and I have used this

(tpu_data) D:>python IndicBERT/tokenization/build_tokenizer.py --input "D:\IndicBERT\ur.txt" --output "D:\IndicBERT\output" --vocab_size 250000

image

After this: as per instructions:
I used this command:

(tpu_data) D:>python IndicBERT/process_data/create_mlm_data.py --input_file="D:\IndicBERT\ur.txt" --output_file="D:\IndicBERT\output" --input_file_type=monolingual --tokenizer="D:\IndicBERT\output\config.json"

image

image

This happened multiple times,

image

AS this whole architecture is not using GPU.
Here are my specs,

Processor: i7-9700k : 3.6GHz
Ram : 32GB
GPU: Nvidia GTX 1660ti (6gb)

I actually have two questions:

How to resolve this memory error? Is there a way to use GPU? as this preprocessing is not utilizing the GPU or should I use Google Colab?

Secondly: As I only require a tokenizer for urdu language, After Preprocess Data , Will I have the tokenizer json file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant