Skip to content

asphytheghoul/Baraat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

74 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Project Baraat: Empowering Regional Languages in India ๐Ÿ‡ฎ๐Ÿ‡ณ with AI

View the project on huggingface over here

Project Baraat ๐ŸŽ‰

baraat

Project Baraat is an open-source initiative to leverage the power of LLMs on Indic-NLP tasks. We aim to build Continually pre-trained, Task Specific Language Models in a Mixture of Experts (MoE) setup. We plan on making a multilingual and cross-lingual LLM that is :

  • 1) Pre-trained on a large text corpus containing various sources of knowledge including crawled wikipedia articles, textbooks, news, social media sites, magazines etc.

  • 2) Fine-tuned on different downstream tasks. We first train a 7B LLaMa-2 model on a text corpus in the target language and save it as a base model. We have considered the following tasks as downstream tasks that will be incorporated in the fine-tuning process:

  1. Machine Translation
  2. Mathematical and Logical Reasoning
  3. Question Answering
  4. Instruct Fine-Tuning

Note

This list is subject to change and a few tasks may be added over time.

Model Tutorial Notebook Link
Baraat-hindi-experts Open In Colab

About Project Baraat ๐Ÿ“–

Project Baraat is dedicated to making indigenous (regional) languages more accessible. With a focus on the rich linguistic diversity of India. This project aims to break language barriers and promote inclusivity through technology.

Roadmap ๐ŸŽฏ

image image

Pre-trained Language Models and Datasets

Model Name Description Dataset Link
Baraat-hindi-pretrained Base model pre-trained on a diverse collection of datasets:

โ€ข IndicCorp: A multilingual corpus covering 9 major Indic languages for various NLP tasks.
โ€ข Hindi Wikipedia Articles (172K): A dataset containing 172,000 Hindi Wikipedia articles.
โ€ข Hindi Corpus from Leipzig University: A Hindi corpus provided by the University of Leipzig.
โ€ข Animals: A Visual Encyclopedia: An encyclopedia of general animal sentences.
โ€ข Augmented rows using Bing AI to include worldly knowledge such as fruits, vegetables, animals.
Link
Baraat-kannada-pretrained Base model pre-trained on a diverse collection of datasets:

โ€ข IndicCorp: A multilingual corpus covering 9 major Indic languages for various NLP tasks.
โ€ข Kannada Corpus from Leipzig University: A Kannada corpus provided by the University of Leipzig.
Link

Key Features โœจ

  • Tokenizers for Indian Languages: Robust tokenization tools tailored for the unique structures of regional Indian languages.
  • Fine-tuned Language Models: Leveraging the power of Large Language Models (LLMs) fine-tuned for Indian languages to understand and generate text with high accuracy.
  • Open Source Collaboration: We believe in the collective power of the community to drive innovation and inclusivity. ๐Ÿค
  • High Quality Datasets: Take a look at our suite of cleaned datasets ready for your own downstream training purposes.

Architecture โœ๏ธ

Architecture

Our Vision ๐ŸŒŸ

To promote the spirit of building accessible models in native languages, fostering a world where technology speaks everyone's language. ๐Ÿ—ฃ๏ธ

Roadmap ๐Ÿ›ฃ๏ธ

  • โœ… Prepare and setup dataset
  • โœ… Prepare and setup tokenizers
  • โœ… Start pre-training
  • โœ… Fine-tune models
  • โœ… Implement gating mechanism
  • โœ… Implement MoE
  • โœ… Simple Demo

Foundational model: LLaMa-2 7B

Small Demo of the project

P.S. The project is still in its early stages and this is a Proof of Concept implementation for Hindi.

Baraat.Small.Demo.mp4
  • We can see here that the model is sensitive to the prompts that are being passed to it and this is a feature prevelant in a wide variety of LLMs today. We aim to train our suite of models for a longer period of time with evaluation steps.
  • The project is being worked on actively and is currently undergoing an update. All utility files are provided in the source directory.

Future Scope ๐Ÿ”œ

  • Extending Support for Images and Audio

In the future, we aim to expand Project Baraat's capabilities beyond text to include support for images and audio, enabling multimodal learning techniques.

  • Pipeline for Automated Dataset Cleaning

We plan to develop a pipeline for dataset cleaning, leveraging small models like stabilityai/stablelm-zephyr-3b or microsoft/phi-2 for automated data cleaning processes.

  • Enhanced Reasoning Ability in Fine-Tuning

We intend to introduce an additional step in fine-tuning to enhance the model's reasoning ability, integrating techniques for logical reasoning and inferencin using datasets like meta-math/MetaMathQA or microsoft/orca-math-word-problems-200k. We plan to release translated versions of the datasets to facilitate research in mathematical reasoning and question answering across diverse linguistic communities.

Contribute to Project Baraat ๐Ÿ› ๏ธ

We welcome open-source contributions! Whether you're a coder, a linguist, or just someone passionate about language accessibility, there's a place for you in Project Baraat. Here's how you can get involved:

  1. Star and Fork: Give us a star โญ on GitHub and fork the repository to start contributing.
  2. Issue Tracker: Report bugs or suggest new features by creating an issue.
  3. Pull Requests: Submit your pull requests with new features, bug fixes, or documentation enhancements.

Check out our CONTRIBUTING.md for more detailed guidelines.

Additional Contributions:

  • Sentence Chunking for Enhanced Pretraining:

We partition sentences from datasets into chunks of predetermined maximum word count. This approach allows for the creation of extended sentences, thereby significantly augmenting the efficacy of the continual pretraining process. This can be applied to any dataset to combine sentences and produce a new dataset with more content per row.

  • Token Counting for Diverse Tokenizers and Datasets:

A token counting mechanism has been integrated, capable of quantifying the number of tokens within any given dataset for any given tokenizer. This feature serves as a fundamental tool for analyzing token distributions and comprehending vocabulary dimensions across datasets. We built this by modifying Sayak Paul's count-tokens-hf-datasets project. We no longer require Google Cloud as a component to count tokens, and the entire process can be performed locally.

  • Token Distribution Visualization and Binning:

We also visualize token distributions within individual sentences of datasets. Additionally, a binning process has been implemented to enhance the interpretability of token distribution patterns. These enhancements provide valuable insights into the structural characteristics of textual data, benefiting both researchers and practitioners.

License ๐Ÿ“„

Project Baraat is released under the MIT License.

Show Your Support ๐ŸŒˆ

If you like Project Baraat, please consider starring the repository and sharing it with your network!


Made with โค๏ธ by Team Baraat,
Akash Kamalesh , Anirudh Lakhotia and Tanistha Hota, PES University, Bengaluru.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages