This repository serves as place for administrative things and a place for generic todos and issues. If you're new to BigCode, we recommend reading the section below to see how you can contribute to the project.
📚Dataset: We’ve collected 130M+ Github repositories and ran a license detector on them. We will soon release a large dataset with files from repositories with permissive licenses. Besides Github, we would like to add more datasets and create a Code Dataset Catalogue.
Open tickets:
- Suggest datasets for the Code Dataset Catalogue
- Redact Personally Identifiable Information (PII) from code datasets
We encourage you to join #wg-dataset if you are interested in discussions about data governance (e.g. regarding the ethical and legal concerns of the training data, OpenRAIL licenses for code applications, etc).
🕵🏻♀️Evaluation: We started working on an evaluation harness to evaluate code generation models in an easy way on a wide range of tasks.
Open tickets:
- Suggest tasks for the Evaluation Harness
- Add selected tasks to the Evaluation Harness
- Design prompts for few-shot evaluation tasks
Please join #wg-evaluation for all discussions on the evaluation of code LLMs.
💪Training: We’ve been training smaller models (350M-1B parameters) on the ServiceNow cluster through a fork of Megatron-LM.
- We’ve ported ALiBi in order to support longer sequences at inference time
- We’ve implemented multi-query attention so as to speed-up incremental decoding
- The goal is to scale to a ~15B parameter model. We will, however, first run several ablation studies on a smaller scale. We will soon release our experiment plan and ask for your feedback!
We encourage you to get in touch with us at #wg-training if you have experience with large-scale transformer training in a multi node setup.
🏎 Inference: We’ve implemented multi-query attention in Transformers and Megatron-LM. While others have reported up to a 10x decoding speed-up over a multi-head attention baseline, we’ve only seen modest improvements of ~25%.
Open tickets:
Please go to the #wg-inference channel for technical discussion on how to improve inference speed of LLMs. You can find a summary of all open tickets here.
This repository contains the Megatron-LM fork used for training the BigCode models.
The BigCode analysis repository is a place for all kinds of analysis reports.
The BigCode evaluation harness is developed to evaluate language models for code on several benchmarks.
The BigCode website contains the source of the website.