Memory Error: While Preprocessing Tokeinzer for Urdu Language #4

HSultankhan · 2023-10-04T04:54:16Z

Hello , I want to create a tokenizer for urdu language and I have used this

(tpu_data) D:>python IndicBERT/tokenization/build_tokenizer.py --input "D:\IndicBERT\ur.txt" --output "D:\IndicBERT\output" --vocab_size 250000

After this: as per instructions:
I used this command:

(tpu_data) D:>python IndicBERT/process_data/create_mlm_data.py --input_file="D:\IndicBERT\ur.txt" --output_file="D:\IndicBERT\output" --input_file_type=monolingual --tokenizer="D:\IndicBERT\output\config.json"

This happened multiple times,

AS this whole architecture is not using GPU.
Here are my specs,

Processor: i7-9700k : 3.6GHz
Ram : 32GB
GPU: Nvidia GTX 1660ti (6gb)

I actually have two questions:

How to resolve this memory error? Is there a way to use GPU? as this preprocessing is not utilizing the GPU or should I use Google Colab?

Secondly: As I only require a tokenizer for urdu language, After Preprocess Data , Will I have the tokenizer json file?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Error: While Preprocessing Tokeinzer for Urdu Language #4

Memory Error: While Preprocessing Tokeinzer for Urdu Language #4

HSultankhan commented Oct 4, 2023

Memory Error: While Preprocessing Tokeinzer for Urdu Language #4

Memory Error: While Preprocessing Tokeinzer for Urdu Language #4

Comments

HSultankhan commented Oct 4, 2023