Skip to content

bigcode-project/admin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

This repository serves as place for administrative things and a place for generic todos and issues. If you're new to BigCode, we recommend reading the section below to see how you can contribute to the project.

What has been done and how can you help?

📚Dataset: We’ve collected 130M+ Github repositories and ran a license detector on them. We will soon release a large dataset with files from repositories with permissive licenses. Besides Github, we would like to add more datasets and create a Code Dataset Catalogue.

Open tickets:

We encourage you to join #wg-dataset if you are interested in discussions about data governance (e.g. regarding the ethical and legal concerns of the training data, OpenRAIL licenses for code applications, etc).

🕵🏻‍♀️Evaluation: We started working on an evaluation harness to evaluate code generation models in an easy way on a wide range of tasks.

Open tickets:

Please join #wg-evaluation for all discussions on the evaluation of code LLMs.

💪Training: We’ve been training smaller models (350M-1B parameters) on the ServiceNow cluster through a fork of Megatron-LM.

  • We’ve ported ALiBi in order to support longer sequences at inference time
  • We’ve implemented multi-query attention so as to speed-up incremental decoding
  • The goal is to scale to a ~15B parameter model. We will, however, first run several ablation studies on a smaller scale. We will soon release our experiment plan and ask for your feedback!

We encourage you to get in touch with us at #wg-training if you have experience with large-scale transformer training in a multi node setup.

🏎 Inference: We’ve implemented multi-query attention in Transformers and Megatron-LM. While others have reported up to a 10x decoding speed-up over a multi-head attention baseline, we’ve only seen modest improvements of ~25%.

Open tickets:

Please go to the #wg-inference channel for technical discussion on how to improve inference speed of LLMs. You can find a summary of all open tickets here.

Overview of our repositories

This repository contains the Megatron-LM fork used for training the BigCode models.

The BigCode analysis repository is a place for all kinds of analysis reports.

The BigCode evaluation harness is developed to evaluate language models for code on several benchmarks.

The BigCode website contains the source of the website.

About

A place for generic issues and administrative things.

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published