Skip to content

ltrc/TeClass

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

TeClass

This repository contains the code and dataset of the research paper titled TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu, published at LREC-COLING 2024.

Task Introduction

Headline Generation is the task of generating a relevant headline that represents the core information present in the news article. The key challenge is that, the presence of irrelevant headlines in news articles scraped from the web often results in sub-optimal performance.

As a solution to overcome this challenge, We propose that Relevance-based Headline Classification can greatly aid the task of generating relevant headlines.

Relevance-based Headline Classification​ is the task of categorizing a news headline based on its relevance to the corresponding news article​ into one of the three primary classes:

  1. Highly Relevant (HREL) : The headline is highly related to the article
  2. Moderately Relevant​ (MREL) : The headline is moderately related to the article
  3. Least Relevant (LREL) : The headline is least related to the article

This task has applications including News Recommendation, Incongruent Headline Detection​​, and Headline Stance Classification​.

Our Key Contributions

1. We present "TeClass", a large, diverse, and high-quality human annotated dataset for Telugu​.

It contains 26,178 article-headline pairs annotated for relevance-based headline classification with one of the three primary categories: ​HREL, MREL, and LREL.

2. We study the impact of fine-tuning headline generation models on different types of headlines (with varying degrees of relevance to the article).

We demonstrate that the task of relevant headline generation is best served when the headline generation models are fine-tuned only highly relevant data even if the highly relevant article-headline pairs are significantly less in number.​ ​

"TeClass" Dataset Statistics

TeClass_Stats_Table

Headline Classification Model Results

Feature-based Machine Learning Models

Feature VectorClassifierF1-Score
HRELMRELLRELOverall
(Weighted)
Overall
(Macro)
Without Feature VectorLR0.570.500.590.550.55
SVM0.550.490.570.530.54
MLP0.550.490.580.540.54
Bagging0.550.470.570.520.53
Cosine SimilarityLR0.580.500.590.550.56
SVM0.560.490.580.540.54
MLP0.560.490.560.530.54
Bagging0.560.470.580.530.54
[Cosine Similarity, LEAD-1, Novel 1-gram %]LR0.610.530.590.580.58
SVM0.600.520.580.570.57
MLP0.600.540.550.560.56
Bagging0.600.510.590.560.57
[Cosine Similarity, LEAD-1, Novel 1-gram %, Novel 2-gram %, EXT-ORACLE]LR0.620.530.590.580.58
SVM0.600.520.580.570.57
MLP0.600.500.610.560.57
Bagging0.600.510.580.560.56

Fine-tuning Pre-trained BERT-based Models

Three Class Classification

Pre-trained ModelF1 Score
HRELMRELLRELOverall
(Weighted)
Overall
(Macro)
IndicBERT0.660.550.670.620.63
mBERT0.660.50.620.590.59
mDeBERTa0.650.590.670.630.64
MuRIL0.660.550.620.610.61
XLMRoBERTa0.670.530.650.610.62

Two Class Classfication

Pre-trained modelF1-Score
Relevant
(FME+STC+FSE)
Less Relevant
(WKC+MLC+SEN+CBT)
Overall
(Weighted)
Overall
(Macro)
IndicBERT0.860.660.790.76
mBERT0.860.630.780.74
mDeBERTa0.850.690.800.77
MuRIL0.730.630.700.68
XLMRoBERTa0.860.680.800.77

Relevance-based Headline Generation Model Results

Relevance-based-HG

ROUGE-L scores of class-based fine-tuning of Headline Generation models.
Fine-tuned on Tested on
FMESTCFSEWKCSENCBT
Zero-shot inference
(Mukhyansh)
0.390.230.250.170.210.15
FME0.450.280.310.210.250.17
STC0.430.270.30.220.230.18
FSE0.410.260.290.220.230.18
WKC0.380.230.280.20.210.15
SEN0.410.260.290.20.230.18
CBT0.390.240.270.210.220.16
Total (6-class)0.430.270.30.220.250.18
3-class (FME, FSE, STC)0.440.280.30.20.250.2
3-class (WKC, SEN, CBT)0.40.250.290.190.230.18

Relevance-based-HG-observations

License

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). Copyright of the dataset contents belongs to the original copyright holders.

Citation

If you use any of the datasets, models or any part of this research work, please cite the following paper:

TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu (Kanumolu et al., LREC-COLING 2024)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published