Skip to content

Commit 250b3f6

Browse files
Update index.html
1 parent 3d8647d commit 250b3f6

File tree

1 file changed

+9
-11
lines changed

1 file changed

+9
-11
lines changed

index.html

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -143,27 +143,25 @@ <h2>Automatic Open-style Question Filtering and Generation:</h2>
143143
<li><b>Stage 2: Confidence Score Assignment:</b> Convertible questions are then assigned a confidence score (1-10) indicating the likelihood of being answered in an open-style format. Questions below a certain threshold are excluded, ensuring only the most suitable questions are selected.</li>
144144
</ul>
145145
</li>
146-
<li><b>Open-style Question Answer Evaluation:</b> We designed customized prompts to evaluate the correctness of LLM responses to open-style questions. The evaluation involves:
146+
<li><b>Open-style Question Answer Evaluation:</b> We designed customized prompts to evaluate the correctness of LLM responses to open-style questions. The evaluation involve:
147147
<ul>
148148
<li>Using the correct MCQ answer as the ground truth.</li>
149-
<li>Ensuring responses do not contain conflicting statements and accommodating minor differences in phrasing.</li>
150149
</ul>
151150
</li>
152151
<li><b>Validation of Automatic Evaluation Strategy:</b> To validate our approach, we manually checked a random sample of 100 results from the automatic evaluation, confirming an error rate of less than 3%.</li>
153-
<li><b>Comprehensive Analysis and Ranking:</b> We conducted a thorough assessment of five well-recognized LLMs (including GPT-4, ChatGPT, Claude-3 Opus, Gemini-Pro, and Mistral-Large) using our benchmark. The results indicate a high correlation between rankings from our open-style benchmark and those derived from human assessments, with GPT-4 emerging as the strongest LLM.</li>
152+
<li><b>Comprehensive Analysis and Ranking:</b> We conducted a thorough assessment of well-recognized LLMs including (GPT-4o, GPT-4, GPT-3.5, Claude-3 Opus, Gemini-Pro, and Mistral-Large) using our benchmark.Theperformance of GPT-4o overall demonstrates its leading edge, with an accuracy of 70.15%, which
153+
indicates its robustness in open-style question answering tasks compared to other models. It is followed
154+
by GPT-4-1106-preview with 65.93%, and Claude-3 Opus with 62.68%. These results highlight the
155+
9
156+
advanced capabilities of the GPT-4 series. Mid-tier models like Mistral Large and GPT-3.5 perform
157+
well but are not on par with the top performers. On the other hand, models like Gemini 1.0 Pro and
158+
Llama3-70b-Instruct lag behind in terms of the capabilities to answer the open-style questions.
159+
The performance evaluation of smaller-scale LLMs reveals that Qwen1.5 leads with an overall.</li>
154160
</ul>
155161
</div>
156162

157163
</div>
158164

159-
<!-- <div class="key-findings">
160-
<h2>Results and Analysis:</h2>
161-
<ol>
162-
<li><b>Benchmark Performance:</b> The Open-LLM-Benchmark evaluated several LLMs, including GPT-4, ChatGPT, Claude-3 Opus, and Gemini-Pro. Our analysis shows that GPT-4 remains the strongest LLM, despite claims by other leaderboards about Claude-3 Opus.</li>
163-
<li><b>Leaderboard Insights:</b> On average, the accuracy of LLMs on open-style questions is about 25% lower than on MCQs, highlighting the difficulty of open-style question answering but providing a more genuine evaluation of LLM capabilities.</li>
164-
<li><b>Dataset Diversity:</b> Our benchmark includes questions from diverse datasets like MMLU, ARC, and MedMCQA, ensuring a comprehensive evaluation across various domains.</li>
165-
</ol>
166-
</div> -->
167165

168166
<div class="key-findings" id="misc">
169167
<h4 class="widget-title"><span><b>Citation.</b></span></h4>

0 commit comments

Comments
 (0)