This repository contains the code and dataset of the research paper titled TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu, published at LREC-COLING 2024.
Headline Generation is the task of generating a relevant headline that represents the core information present in the news article. The key challenge is that, the presence of irrelevant headlines in news articles scraped from the web often results in sub-optimal performance.
As a solution to overcome this challenge, We propose that Relevance-based Headline Classification can greatly aid the task of generating relevant headlines.
Relevance-based Headline Classification is the task of categorizing a news headline based on its relevance to the corresponding news article into one of the three primary classes:
- Highly Relevant (HREL) : The headline is highly related to the article
- Moderately Relevant (MREL) : The headline is moderately related to the article
- Least Relevant (LREL) : The headline is least related to the article
This task has applications including News Recommendation, Incongruent Headline Detection, and Headline Stance Classification.
It contains 26,178 article-headline pairs annotated for relevance-based headline classification with one of the three primary categories: HREL, MREL, and LREL.
2. We study the impact of fine-tuning headline generation models on different types of headlines (with varying degrees of relevance to the article).
We demonstrate that the task of relevant headline generation is best served when the headline generation models are fine-tuned only highly relevant data even if the highly relevant article-headline pairs are significantly less in number.
Feature Vector | Classifier | F1-Score | ||||
---|---|---|---|---|---|---|
HREL | MREL | LREL | Overall (Weighted) | Overall (Macro) | ||
Without Feature Vector | LR | 0.57 | 0.50 | 0.59 | 0.55 | 0.55 |
SVM | 0.55 | 0.49 | 0.57 | 0.53 | 0.54 | |
MLP | 0.55 | 0.49 | 0.58 | 0.54 | 0.54 | |
Bagging | 0.55 | 0.47 | 0.57 | 0.52 | 0.53 | |
Cosine Similarity | LR | 0.58 | 0.50 | 0.59 | 0.55 | 0.56 |
SVM | 0.56 | 0.49 | 0.58 | 0.54 | 0.54 | |
MLP | 0.56 | 0.49 | 0.56 | 0.53 | 0.54 | |
Bagging | 0.56 | 0.47 | 0.58 | 0.53 | 0.54 | |
[Cosine Similarity, LEAD-1, Novel 1-gram %] | LR | 0.61 | 0.53 | 0.59 | 0.58 | 0.58 |
SVM | 0.60 | 0.52 | 0.58 | 0.57 | 0.57 | |
MLP | 0.60 | 0.54 | 0.55 | 0.56 | 0.56 | |
Bagging | 0.60 | 0.51 | 0.59 | 0.56 | 0.57 | |
[Cosine Similarity, LEAD-1, Novel 1-gram %, Novel 2-gram %, EXT-ORACLE] | LR | 0.62 | 0.53 | 0.59 | 0.58 | 0.58 |
SVM | 0.60 | 0.52 | 0.58 | 0.57 | 0.57 | |
MLP | 0.60 | 0.50 | 0.61 | 0.56 | 0.57 | |
Bagging | 0.60 | 0.51 | 0.58 | 0.56 | 0.56 |
Pre-trained Model | F1 Score | ||||
---|---|---|---|---|---|
HREL | MREL | LREL | Overall (Weighted) | Overall (Macro) | |
IndicBERT | 0.66 | 0.55 | 0.67 | 0.62 | 0.63 |
mBERT | 0.66 | 0.5 | 0.62 | 0.59 | 0.59 |
mDeBERTa | 0.65 | 0.59 | 0.67 | 0.63 | 0.64 |
MuRIL | 0.66 | 0.55 | 0.62 | 0.61 | 0.61 |
XLMRoBERTa | 0.67 | 0.53 | 0.65 | 0.61 | 0.62 |
Pre-trained model | F1-Score | |||
---|---|---|---|---|
Relevant (FME+STC+FSE) | Less Relevant (WKC+MLC+SEN+CBT) | Overall (Weighted) | Overall (Macro) | |
IndicBERT | 0.86 | 0.66 | 0.79 | 0.76 |
mBERT | 0.86 | 0.63 | 0.78 | 0.74 |
mDeBERTa | 0.85 | 0.69 | 0.80 | 0.77 |
MuRIL | 0.73 | 0.63 | 0.70 | 0.68 |
XLMRoBERTa | 0.86 | 0.68 | 0.80 | 0.77 |
Fine-tuned on | Tested on | |||||
---|---|---|---|---|---|---|
FME | STC | FSE | WKC | SEN | CBT | |
Zero-shot inference (Mukhyansh) | 0.39 | 0.23 | 0.25 | 0.17 | 0.21 | 0.15 |
FME | 0.45 | 0.28 | 0.31 | 0.21 | 0.25 | 0.17 |
STC | 0.43 | 0.27 | 0.3 | 0.22 | 0.23 | 0.18 |
FSE | 0.41 | 0.26 | 0.29 | 0.22 | 0.23 | 0.18 |
WKC | 0.38 | 0.23 | 0.28 | 0.2 | 0.21 | 0.15 |
SEN | 0.41 | 0.26 | 0.29 | 0.2 | 0.23 | 0.18 |
CBT | 0.39 | 0.24 | 0.27 | 0.21 | 0.22 | 0.16 |
Total (6-class) | 0.43 | 0.27 | 0.3 | 0.22 | 0.25 | 0.18 |
3-class (FME, FSE, STC) | 0.44 | 0.28 | 0.3 | 0.2 | 0.25 | 0.2 |
3-class (WKC, SEN, CBT) | 0.4 | 0.25 | 0.29 | 0.19 | 0.23 | 0.18 |
Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). Copyright of the dataset contents belongs to the original copyright holders.
If you use any of the datasets, models or any part of this research work, please cite the following paper:
TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu (Kanumolu et al., LREC-COLING 2024)