You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+9-6Lines changed: 9 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# OpenLLMs: Less is More for Open-source Models
2
2
3
-
OpenLLMs is a series of open-source language models fine-tuned on a small, yet diverse and high-quality dataset of multi-round conversations. Specifically, we utilize only ~6K GPT-4 conversations directly filtered from the ~90K ShareGPT conversations. Despite the small size of the dataset, OpenLLMs has demonstrated remarkable performance.
3
+
OpenLLMs is a series of open-source language models fine-tuned on a small, yet diverse and high-quality dataset of multi-round conversations. Specifically, we utilize only ~6K GPT-4 conversations directly filtered from the ~90K ShareGPT conversations. Despite the small size of the dataset, OpenLLMs has demonstrated remarkable performance. We hope this project can bring you some insights, such as the importance of data quality, the bias of evaluation benchmarks, and so on.
4
4
5
5
**🔥 80.9% win-rate, rank #1 of open-source models on [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/)**
6
6
@@ -39,7 +39,11 @@ OpenLLMs is a series of open-source language models fine-tuned on a small, yet d
39
39
40
40
## Model Evaluation
41
41
42
-
We have evaluated our models using the two most popular evaluation benchmarks, including Vicuna GPT-4 and AlpacaEval benchmarks. The evaluation results are presented in the following figures.
42
+
We have evaluated our models using the two most popular evaluation benchmarks, including Vicuna GPT-4 and AlpacaEval benchmarks. Additionally, we are ongoing to evaluate our models on MT-bench advised by the lmsys team. The evaluation results are presented in the following figures.
43
+
44
+
### MT-bench (In progress)
45
+
Please wait for our results :)
46
+
43
47
44
48
### Vicuna Evaluation
45
49
@@ -53,7 +57,6 @@ Considering that our fine-tuning dataset is produced by GPT-4, we use both GPT-4
53
57
54
58

55
59
56
-
57
60
### AlpacaEval
58
61
59
62
Here we list the minimal version of AlpacaEval with our released models. The full version of AlpacaEval can be found on this [page](https://tatsu-lab.github.io/alpaca_eval/).
@@ -80,9 +83,9 @@ Here we list the minimal version of AlpacaEval with our released models. The ful
80
83
81
84
Due to the limitations of Vicuna GPT-4 Evaluation and AlpacaEval, we are trying to use extensive standard benchmarks to evaluate the performance of OpenLLMs.
0 commit comments