The "anwser" for some examples in "qasper.jsonl" is strange #67

Zcchill · 2024-07-09T11:48:59Z

I download the data from the offcial url and I found that the "answers" of several examples in "qasper.jsonl" are confusing. Here are several examples:
{"pred": "No", "answers": ["Yes", "No"], "all_classes": null, "length": 2317, "input": "Does this method help in sentiment classification task improvement?", "_id": "bcfe56efad9715cc714ffd2e523eaa9ad796a453e7da77a6"}
{"pred": "unanswerable", "answers": ["Yes", "Unanswerable"], "all_classes": null, "length": 2284, "actual_length": 3533, "input": "Is jiant compatible with models in any programming language?", "_id": "e5d1d589ddb30f43547012f04b06ac2924a1f4fdcf56daab"}
{"pred": "BERTBase", "answers": ["BERTbase", "BERTbase"], "all_classes": null, "length": 3852, "actual_length": 5701, "input": "What BERT model do they test?", "_id": "2a51c07e65a9214ed2cd3c04303afa205e005f4e1ccb172a"}

Zcchill · 2024-07-10T06:31:58Z

Another example: "_id": "d1aa1132439bd292965634095bf1c9943e062bb6645ff78c".
The query is "how many tags do they look at?"
The given answer seem to source from "We employ two sources of e-book annotation data: (i) editor tags, and (ii) Amazon search terms. For editor tags, we collect data of 48,705 e-books from 13 publishers, namely Kunstmann, Delius-Klasnig, VUR, HJR, Diogenes, Campus, Kiwi, Beltz, Chbeck, Rowohlt, Droemer, Fischer and Neopubli."
But I think the answer of "30 tags" based on "As shown in Table TABREF3 , we collect Amazon review keywords for 2,896 e-books (publishers: Kiwi, Rowohlt, Fischer, and Droemer), which leads to 33,663 distinct review keywords and on average 30 keyword assignments per e-book.\nTag Recommendation Approaches" is better.

bys0318 · 2024-07-10T07:50:22Z

Thanks for your keen observation. We sample the data directly from the test data of Qasper, we suggest you ask the authors of Qasper.

Zcchill · 2024-07-10T16:56:39Z

Besides, I would like to replicate the results of "GPT-3.5-Turbo-16k" in paper but get results not so close with the results reported in the paper. I wonder the possible reasons since there is no official code for api method.
The results I get are as followed:
{
"2wikimqa": {
"0-4k": 57.09,
"4-8k": 42.82,
"8k+": 32.71
},
"hotpotqa": {
"0-4k": 68.44,
"4-8k": 57.25,
"8k+": 55.38
},
"multi_news": {
"0-4k": 28.57,
"4-8k": 23.34,
"8k+": 22.31
},
"qasper": {
"0-4k": 47.3,
"4-8k": 43.97,
"8k+": 28.35
},
"multifieldqa_en": {
"0-4k": 57.15,
"4-8k": 51.67,
"8k+": 57.52
},
"gov_report": {
"0-4k": 31.79,
"4-8k": 28.82,
"8k+": 27.34
}
}
Experiment setting:

I use the api supported by AzureOpenAI.
The system prompt is None. [{"role":"system","content":''}, {"role":"user","content":prompt}]
inference hyper-parameters:
completion = client.chat.completions.create(
model="gpt-35-turbo-16k",
messages=input,
temperature=0.0,
max_tokens=max_tokens,
stop=stop_token,
)
response = completion.choices[0].message.content

bys0318 · 2024-07-11T08:09:30Z

This might be due to the model iteration. We tested the GPT-3.5-Turbo-16k at August, 2023. I think it has a different version now.

Zcchill · 2024-07-15T05:41:48Z

"You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write "unanswerable". If the question is a yes/no question, answer "yes", "no", or "unanswerable". Do not provide any explanation.\n\nArticle: {context}\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write "unanswerable". If the question is a yes/no question, answer "yes", "no", or "unanswerable". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:" The instruction for qasper tasks in dataset2prompt seems redundent, is this a mistake or a deliberate strategy to emphasize the task at both the beginning and the end of a long text (due to position bias)?

bys0318 · 2024-07-15T06:59:46Z

You're right. We want to emphasize the task instruction, so we insert the instruction at both the start and the end of the input.

Zcchill mentioned this issue Jul 10, 2024

The "anwser" for some examples is confusing allenai/qasper-led-baseline#27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The "anwser" for some examples in "qasper.jsonl" is strange #67

The "anwser" for some examples in "qasper.jsonl" is strange #67

Zcchill commented Jul 9, 2024

Zcchill commented Jul 10, 2024

bys0318 commented Jul 10, 2024

Zcchill commented Jul 10, 2024

bys0318 commented Jul 11, 2024 •

edited

Loading

Zcchill commented Jul 15, 2024

bys0318 commented Jul 15, 2024

The "anwser" for some examples in "qasper.jsonl" is strange #67

The "anwser" for some examples in "qasper.jsonl" is strange #67

Comments

Zcchill commented Jul 9, 2024

Zcchill commented Jul 10, 2024

bys0318 commented Jul 10, 2024

Zcchill commented Jul 10, 2024

bys0318 commented Jul 11, 2024 • edited Loading

Zcchill commented Jul 15, 2024

bys0318 commented Jul 15, 2024

bys0318 commented Jul 11, 2024 •

edited

Loading