We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hello , I want to create a tokenizer for urdu language and I have used this
(tpu_data) D:>python IndicBERT/tokenization/build_tokenizer.py --input "D:\IndicBERT\ur.txt" --output "D:\IndicBERT\output" --vocab_size 250000
After this: as per instructions: I used this command:
(tpu_data) D:>python IndicBERT/process_data/create_mlm_data.py --input_file="D:\IndicBERT\ur.txt" --output_file="D:\IndicBERT\output" --input_file_type=monolingual --tokenizer="D:\IndicBERT\output\config.json"
This happened multiple times,
AS this whole architecture is not using GPU. Here are my specs,
Processor: i7-9700k : 3.6GHz Ram : 32GB GPU: Nvidia GTX 1660ti (6gb)
I actually have two questions:
How to resolve this memory error? Is there a way to use GPU? as this preprocessing is not utilizing the GPU or should I use Google Colab?
Secondly: As I only require a tokenizer for urdu language, After Preprocess Data , Will I have the tokenizer json file?
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Hello , I want to create a tokenizer for urdu language and I have used this
(tpu_data) D:>python IndicBERT/tokenization/build_tokenizer.py --input "D:\IndicBERT\ur.txt" --output "D:\IndicBERT\output" --vocab_size 250000
After this: as per instructions:
I used this command:
(tpu_data) D:>python IndicBERT/process_data/create_mlm_data.py --input_file="D:\IndicBERT\ur.txt" --output_file="D:\IndicBERT\output" --input_file_type=monolingual --tokenizer="D:\IndicBERT\output\config.json"
This happened multiple times,
AS this whole architecture is not using GPU.
Here are my specs,
Processor: i7-9700k : 3.6GHz
Ram : 32GB
GPU: Nvidia GTX 1660ti (6gb)
I actually have two questions:
How to resolve this memory error? Is there a way to use GPU? as this preprocessing is not utilizing the GPU or should I use Google Colab?
Secondly: As I only require a tokenizer for urdu language, After Preprocess Data , Will I have the tokenizer json file?
The text was updated successfully, but these errors were encountered: