TeClass

This repository contains the code and dataset of the research paper titled TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu, published at LREC-COLING 2024.

Task Introduction

Headline Generation is the task of generating a relevant headline that represents the core information present in the news article. The key challenge is that, the presence of irrelevant headlines in news articles scraped from the web often results in sub-optimal performance.

As a solution to overcome this challenge, We propose that Relevance-based Headline Classification can greatly aid the task of generating relevant headlines.

Relevance-based Headline Classification is the task of categorizing a news headline based on its relevance to the corresponding news article into one of the three primary classes:

Highly Relevant (HREL) : The headline is highly related to the article
Moderately Relevant (MREL) : The headline is moderately related to the article
Least Relevant (LREL) : The headline is least related to the article

This task has applications including News Recommendation, Incongruent Headline Detection, and Headline Stance Classification.

Our Key Contributions

1. We present "TeClass", a large, diverse, and high-quality human annotated dataset for Telugu.

It contains 26,178 article-headline pairs annotated for relevance-based headline classification with one of the three primary categories: HREL, MREL, and LREL.

2. We study the impact of fine-tuning headline generation models on different types of headlines (with varying degrees of relevance to the article).

We demonstrate that the task of relevant headline generation is best served when the headline generation models are fine-tuned only highly relevant data even if the highly relevant article-headline pairs are significantly less in number.

"TeClass" Dataset Statistics

Headline Classification Model Results

Feature-based Machine Learning Models

Feature Vector	Classifier	F1-Score
Feature Vector	Classifier	HREL	MREL	LREL	Overall (Weighted)	Overall (Macro)
Without Feature Vector	LR	0.57	0.50	0.59	0.55	0.55
	SVM	0.55	0.49	0.57	0.53	0.54
	MLP	0.55	0.49	0.58	0.54	0.54
	Bagging	0.55	0.47	0.57	0.52	0.53
Cosine Similarity	LR	0.58	0.50	0.59	0.55	0.56
	SVM	0.56	0.49	0.58	0.54	0.54
	MLP	0.56	0.49	0.56	0.53	0.54
	Bagging	0.56	0.47	0.58	0.53	0.54
[Cosine Similarity, LEAD-1, Novel 1-gram %]	LR	0.61	0.53	0.59	0.58	0.58
	SVM	0.60	0.52	0.58	0.57	0.57
	MLP	0.60	0.54	0.55	0.56	0.56
	Bagging	0.60	0.51	0.59	0.56	0.57
[Cosine Similarity, LEAD-1, Novel 1-gram %, Novel 2-gram %, EXT-ORACLE]	LR	0.62	0.53	0.59	0.58	0.58
	SVM	0.60	0.52	0.58	0.57	0.57
	MLP	0.60	0.50	0.61	0.56	0.57
	Bagging	0.60	0.51	0.58	0.56	0.56

Fine-tuning Pre-trained BERT-based Models

Three Class Classification

Pre-trained Model	F1 Score
Pre-trained Model	HREL	MREL	LREL	Overall (Weighted)	Overall (Macro)
IndicBERT	0.66	0.55	0.67	0.62	0.63
mBERT	0.66	0.5	0.62	0.59	0.59
mDeBERTa	0.65	0.59	0.67	0.63	0.64
MuRIL	0.66	0.55	0.62	0.61	0.61
XLMRoBERTa	0.67	0.53	0.65	0.61	0.62

Two Class Classfication

Pre-trained model	F1-Score
Pre-trained model	Relevant (FME+STC+FSE)	Less Relevant (WKC+MLC+SEN+CBT)	Overall (Weighted)	Overall (Macro)
IndicBERT	0.86	0.66	0.79	0.76
mBERT	0.86	0.63	0.78	0.74
mDeBERTa	0.85	0.69	0.80	0.77
MuRIL	0.73	0.63	0.70	0.68
XLMRoBERTa	0.86	0.68	0.80	0.77

Relevance-based Headline Generation Model Results

ROUGE-L scores of class-based fine-tuning of Headline Generation models.

Fine-tuned on	Tested on
Fine-tuned on	FME	STC	FSE	WKC	SEN	CBT
Zero-shot inference (Mukhyansh)	0.39	0.23	0.25	0.17	0.21	0.15
FME	0.45	0.28	0.31	0.21	0.25	0.17
STC	0.43	0.27	0.3	0.22	0.23	0.18
FSE	0.41	0.26	0.29	0.22	0.23	0.18
WKC	0.38	0.23	0.28	0.2	0.21	0.15
SEN	0.41	0.26	0.29	0.2	0.23	0.18
CBT	0.39	0.24	0.27	0.21	0.22	0.16
Total (6-class)	0.43	0.27	0.3	0.22	0.25	0.18
3-class (FME, FSE, STC)	0.44	0.28	0.3	0.2	0.25	0.2
3-class (WKC, SEN, CBT)	0.4	0.25	0.29	0.19	0.23	0.18

License

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). Copyright of the dataset contents belongs to the original copyright holders.

Citation

If you use any of the datasets, models or any part of this research work, please cite the following paper:

TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu (Kanumolu et al., LREC-COLING 2024)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Code		Code
Dataset		Dataset
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TeClass

Task Introduction

Our Key Contributions

1. We present "TeClass", a large, diverse, and high-quality human annotated dataset for Telugu.

2. We study the impact of fine-tuning headline generation models on different types of headlines (with varying degrees of relevance to the article).

"TeClass" Dataset Statistics

Headline Classification Model Results

Feature-based Machine Learning Models

Fine-tuning Pre-trained BERT-based Models

Three Class Classification

Two Class Classfication

Relevance-based Headline Generation Model Results

License

Citation

About

Releases

Packages

Contributors 2

Languages

ltrc/TeClass

Folders and files

Latest commit

History

Repository files navigation

TeClass

Task Introduction

Our Key Contributions

1. We present "TeClass", a large, diverse, and high-quality human annotated dataset for Telugu​.

2. We study the impact of fine-tuning headline generation models on different types of headlines (with varying degrees of relevance to the article).

"TeClass" Dataset Statistics

Headline Classification Model Results

Feature-based Machine Learning Models

Fine-tuning Pre-trained BERT-based Models

Three Class Classification

Two Class Classfication

Relevance-based Headline Generation Model Results

License

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

1. We present "TeClass", a large, diverse, and high-quality human annotated dataset for Telugu.

Packages