Project Baraat: Empowering Regional Languages in India 🇮🇳 with AI

View the project on huggingface over here

Project Baraat 🎉

Project Baraat is an open-source initiative to leverage the power of LLMs on Indic-NLP tasks. We aim to build Continually pre-trained, Task Specific Language Models in a Mixture of Experts (MoE) setup. We plan on making a multilingual and cross-lingual LLM that is :

1) Pre-trained on a large text corpus containing various sources of knowledge including crawled wikipedia articles, textbooks, news, social media sites, magazines etc.
2) Fine-tuned on different downstream tasks. We first train a 7B LLaMa-2 model on a text corpus in the target language and save it as a base model. We have considered the following tasks as downstream tasks that will be incorporated in the fine-tuning process:

Machine Translation
Mathematical and Logical Reasoning
Question Answering
Instruct Fine-Tuning

Note

This list is subject to change and a few tasks may be added over time.

Model Tutorial	Notebook Link
Baraat-hindi-experts

About Project Baraat 📖

Project Baraat is dedicated to making indigenous (regional) languages more accessible. With a focus on the rich linguistic diversity of India. This project aims to break language barriers and promote inclusivity through technology.

Roadmap 🎯

Pre-trained Language Models and Datasets

Model Name	Description	Dataset Link
Baraat-hindi-pretrained	Base model pre-trained on a diverse collection of datasets: • IndicCorp: A multilingual corpus covering 9 major Indic languages for various NLP tasks. • Hindi Wikipedia Articles (172K): A dataset containing 172,000 Hindi Wikipedia articles. • Hindi Corpus from Leipzig University: A Hindi corpus provided by the University of Leipzig. • Animals: A Visual Encyclopedia: An encyclopedia of general animal sentences. • Augmented rows using Bing AI to include worldly knowledge such as fruits, vegetables, animals.	Link
Baraat-kannada-pretrained	Base model pre-trained on a diverse collection of datasets: • IndicCorp: A multilingual corpus covering 9 major Indic languages for various NLP tasks. • Kannada Corpus from Leipzig University: A Kannada corpus provided by the University of Leipzig.	Link

Key Features ✨

Tokenizers for Indian Languages: Robust tokenization tools tailored for the unique structures of regional Indian languages.
Fine-tuned Language Models: Leveraging the power of Large Language Models (LLMs) fine-tuned for Indian languages to understand and generate text with high accuracy.
Open Source Collaboration: We believe in the collective power of the community to drive innovation and inclusivity. 🤝
High Quality Datasets: Take a look at our suite of cleaned datasets ready for your own downstream training purposes.

Architecture ✏️

Our Vision 🌟

To promote the spirit of building accessible models in native languages, fostering a world where technology speaks everyone's language. 🗣️

Roadmap 🛣️

✅ Prepare and setup dataset
✅ Prepare and setup tokenizers
✅ Start pre-training
✅ Fine-tune models
✅ Implement gating mechanism
✅ Implement MoE
✅ Simple Demo

Foundational model: LLaMa-2 7B

Small Demo of the project

P.S. The project is still in its early stages and this is a Proof of Concept implementation for Hindi.

Baraat.Small.Demo.mp4

We can see here that the model is sensitive to the prompts that are being passed to it and this is a feature prevelant in a wide variety of LLMs today. We aim to train our suite of models for a longer period of time with evaluation steps.
The project is being worked on actively and is currently undergoing an update. All utility files are provided in the source directory.

Future Scope 🔜

Extending Support for Images and Audio

In the future, we aim to expand Project Baraat's capabilities beyond text to include support for images and audio, enabling multimodal learning techniques.

Pipeline for Automated Dataset Cleaning

We plan to develop a pipeline for dataset cleaning, leveraging small models like stabilityai/stablelm-zephyr-3b or microsoft/phi-2 for automated data cleaning processes.

Enhanced Reasoning Ability in Fine-Tuning

We intend to introduce an additional step in fine-tuning to enhance the model's reasoning ability, integrating techniques for logical reasoning and inferencin using datasets like meta-math/MetaMathQA or microsoft/orca-math-word-problems-200k. We plan to release translated versions of the datasets to facilitate research in mathematical reasoning and question answering across diverse linguistic communities.

Contribute to Project Baraat 🛠️

We welcome open-source contributions! Whether you're a coder, a linguist, or just someone passionate about language accessibility, there's a place for you in Project Baraat. Here's how you can get involved:

Star and Fork: Give us a star ⭐ on GitHub and fork the repository to start contributing.
Issue Tracker: Report bugs or suggest new features by creating an issue.
Pull Requests: Submit your pull requests with new features, bug fixes, or documentation enhancements.

Check out our CONTRIBUTING.md for more detailed guidelines.

Additional Contributions:

Sentence Chunking for Enhanced Pretraining:

We partition sentences from datasets into chunks of predetermined maximum word count. This approach allows for the creation of extended sentences, thereby significantly augmenting the efficacy of the continual pretraining process. This can be applied to any dataset to combine sentences and produce a new dataset with more content per row.

Token Counting for Diverse Tokenizers and Datasets:

A token counting mechanism has been integrated, capable of quantifying the number of tokens within any given dataset for any given tokenizer. This feature serves as a fundamental tool for analyzing token distributions and comprehending vocabulary dimensions across datasets. We built this by modifying Sayak Paul's count-tokens-hf-datasets project. We no longer require Google Cloud as a component to count tokens, and the entire process can be performed locally.

Token Distribution Visualization and Binning:

We also visualize token distributions within individual sentences of datasets. Additionally, a binning process has been implemented to enhance the interpretability of token distribution patterns. These enhancements provide valuable insights into the structural characteristics of textual data, benefiting both researchers and practitioners.

License 📄

Project Baraat is released under the MIT License.

Show Your Support 🌈

If you like Project Baraat, please consider starring the repository and sharing it with your network!

Made with ❤️ by Team Baraat,
Akash Kamalesh , Anirudh Lakhotia and Tanistha Hota, PES University, Bengaluru.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
source		source
tokenizers		tokenizers
utils		utils
Baraat Small Demo.mp4		Baraat Small Demo.mp4
DATASETS.md		DATASETS.md
LICENSE.md		LICENSE.md
MOE.md		MOE.md
Project Hasgeek Hackathon - Baraat.pptx		Project Hasgeek Hackathon - Baraat.pptx
README.md		README.md
flowchart.png		flowchart.png
logo.jpg		logo.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Baraat: Empowering Regional Languages in India 🇮🇳 with AI

View the project on huggingface over here

Project Baraat 🎉

About Project Baraat 📖

Roadmap 🎯

Pre-trained Language Models and Datasets

Key Features ✨

Architecture ✏️

Our Vision 🌟

Roadmap 🛣️

Small Demo of the project

Future Scope 🔜

Extending Support for Images and Audio

Pipeline for Automated Dataset Cleaning

Enhanced Reasoning Ability in Fine-Tuning

Contribute to Project Baraat 🛠️

Additional Contributions:

Sentence Chunking for Enhanced Pretraining:

Token Counting for Diverse Tokenizers and Datasets:

Token Distribution Visualization and Binning:

License 📄

Show Your Support 🌈

About

Releases

Packages

Contributors 3

Languages

License

asphytheghoul/Baraat

Folders and files

Latest commit

History

Repository files navigation

Project Baraat: Empowering Regional Languages in India 🇮🇳 with AI

View the project on huggingface over here

Project Baraat 🎉

About Project Baraat 📖

Roadmap 🎯

Pre-trained Language Models and Datasets

Key Features ✨

Architecture ✏️

Our Vision 🌟

Roadmap 🛣️

Small Demo of the project

Future Scope 🔜

Extending Support for Images and Audio

Pipeline for Automated Dataset Cleaning

Enhanced Reasoning Ability in Fine-Tuning

Contribute to Project Baraat 🛠️

Additional Contributions:

Sentence Chunking for Enhanced Pretraining:

Token Counting for Diverse Tokenizers and Datasets:

Token Distribution Visualization and Binning:

License 📄

Show Your Support 🌈

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages