You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: index.html
+9-11Lines changed: 9 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -143,27 +143,25 @@ <h2>Automatic Open-style Question Filtering and Generation:</h2>
143
143
<li><b>Stage 2: Confidence Score Assignment:</b> Convertible questions are then assigned a confidence score (1-10) indicating the likelihood of being answered in an open-style format. Questions below a certain threshold are excluded, ensuring only the most suitable questions are selected.</li>
144
144
</ul>
145
145
</li>
146
-
<li><b>Open-style Question Answer Evaluation:</b> We designed customized prompts to evaluate the correctness of LLM responses to open-style questions. The evaluation involves:
146
+
<li><b>Open-style Question Answer Evaluation:</b> We designed customized prompts to evaluate the correctness of LLM responses to open-style questions. The evaluation involve:
147
147
<ul>
148
148
<li>Using the correct MCQ answer as the ground truth.</li>
149
-
<li>Ensuring responses do not contain conflicting statements and accommodating minor differences in phrasing.</li>
150
149
</ul>
151
150
</li>
152
151
<li><b>Validation of Automatic Evaluation Strategy:</b> To validate our approach, we manually checked a random sample of 100 results from the automatic evaluation, confirming an error rate of less than 3%.</li>
153
-
<li><b>Comprehensive Analysis and Ranking:</b> We conducted a thorough assessment of five well-recognized LLMs (including GPT-4, ChatGPT, Claude-3 Opus, Gemini-Pro, and Mistral-Large) using our benchmark. The results indicate a high correlation between rankings from our open-style benchmark and those derived from human assessments, with GPT-4 emerging as the strongest LLM.</li>
152
+
<li><b>Comprehensive Analysis and Ranking:</b> We conducted a thorough assessment of well-recognized LLMs including (GPT-4o, GPT-4, GPT-3.5, Claude-3 Opus, Gemini-Pro, and Mistral-Large) using our benchmark.Theperformance of GPT-4o overall demonstrates its leading edge, with an accuracy of 70.15%, which
153
+
indicates its robustness in open-style question answering tasks compared to other models. It is followed
154
+
by GPT-4-1106-preview with 65.93%, and Claude-3 Opus with 62.68%. These results highlight the
155
+
9
156
+
advanced capabilities of the GPT-4 series. Mid-tier models like Mistral Large and GPT-3.5 perform
157
+
well but are not on par with the top performers. On the other hand, models like Gemini 1.0 Pro and
158
+
Llama3-70b-Instruct lag behind in terms of the capabilities to answer the open-style questions.
159
+
The performance evaluation of smaller-scale LLMs reveals that Qwen1.5 leads with an overall.</li>
154
160
</ul>
155
161
</div>
156
162
157
163
</div>
158
164
159
-
<!-- <div class="key-findings">
160
-
<h2>Results and Analysis:</h2>
161
-
<ol>
162
-
<li><b>Benchmark Performance:</b> The Open-LLM-Benchmark evaluated several LLMs, including GPT-4, ChatGPT, Claude-3 Opus, and Gemini-Pro. Our analysis shows that GPT-4 remains the strongest LLM, despite claims by other leaderboards about Claude-3 Opus.</li>
163
-
<li><b>Leaderboard Insights:</b> On average, the accuracy of LLMs on open-style questions is about 25% lower than on MCQs, highlighting the difficulty of open-style question answering but providing a more genuine evaluation of LLM capabilities.</li>
164
-
<li><b>Dataset Diversity:</b> Our benchmark includes questions from diverse datasets like MMLU, ARC, and MedMCQA, ensuring a comprehensive evaluation across various domains.</li>
0 commit comments