Skip to content

Commit bc757c3

Browse files
authored
Update README.md
1 parent 80bd3aa commit bc757c3

File tree

1 file changed

+9
-6
lines changed

1 file changed

+9
-6
lines changed

README.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# OpenLLMs: Less is More for Open-source Models
22

3-
OpenLLMs is a series of open-source language models fine-tuned on a small, yet diverse and high-quality dataset of multi-round conversations. Specifically, we utilize only ~6K GPT-4 conversations directly filtered from the ~90K ShareGPT conversations. Despite the small size of the dataset, OpenLLMs has demonstrated remarkable performance.
3+
OpenLLMs is a series of open-source language models fine-tuned on a small, yet diverse and high-quality dataset of multi-round conversations. Specifically, we utilize only ~6K GPT-4 conversations directly filtered from the ~90K ShareGPT conversations. Despite the small size of the dataset, OpenLLMs has demonstrated remarkable performance. We hope this project can bring you some insights, such as the importance of data quality, the bias of evaluation benchmarks, and so on.
44

55
**🔥 80.9% win-rate, rank #1 of open-source models on [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/)**
66

@@ -39,7 +39,11 @@ OpenLLMs is a series of open-source language models fine-tuned on a small, yet d
3939

4040
## Model Evaluation
4141

42-
We have evaluated our models using the two most popular evaluation benchmarks, including Vicuna GPT-4 and AlpacaEval benchmarks. The evaluation results are presented in the following figures.
42+
We have evaluated our models using the two most popular evaluation benchmarks, including Vicuna GPT-4 and AlpacaEval benchmarks. Additionally, we are ongoing to evaluate our models on MT-bench advised by the lmsys team. The evaluation results are presented in the following figures.
43+
44+
### MT-bench (In progress)
45+
Please wait for our results :)
46+
4347

4448
### Vicuna Evaluation
4549

@@ -53,7 +57,6 @@ Considering that our fine-tuning dataset is produced by GPT-4, we use both GPT-4
5357

5458
![gpt35eval](assets/vicuna_gpt35.svg)
5559

56-
5760
### AlpacaEval
5861

5962
Here we list the minimal version of AlpacaEval with our released models. The full version of AlpacaEval can be found on this [page](https://tatsu-lab.github.io/alpaca_eval/).
@@ -80,9 +83,9 @@ Here we list the minimal version of AlpacaEval with our released models. The ful
8083

8184
Due to the limitations of Vicuna GPT-4 Evaluation and AlpacaEval, we are trying to use extensive standard benchmarks to evaluate the performance of OpenLLMs.
8285

83-
| Models | LLaMA-13B BFloat16 | OpenChat | OpenChat8192 |
84-
|---------------------------------------------------------------------------------|--------------------|-----------|--------------|
85-
| MMLU [(chain-of-thought hub)](https://github.com/FranxYao/chain-of-thought-hub) | 46.66 | **48.53** | 45.16 |
86+
| Models | LLaMA-13B BFloat16 | OpenChat | OpenChat8192 | gpt-3.5-turbo|
87+
|---------------------------------------------------------------------------------|--------------------|-----------|--------------|--------------|
88+
| MMLU [(chain-of-thought hub)](https://github.com/FranxYao/chain-of-thought-hub) | 46.66 | **48.53** | 45.16 |67.3 |
8689

8790
## Installation
8891

0 commit comments

Comments
 (0)