Skip to content

Sequence to sequence language resources for Catalan and for two tasks, namely: Summarization and Machine Translation.

License

Notifications You must be signed in to change notification settings

TeMU-BSC/seq-to-seq-catalan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sequence-to-sequence Resources for Catalan

In this work, we introduce sequence-to-sequence language resources for Catalan, a moderately under-resourced language, towards two tasks, namely: Summarization and Machine Translation (MT). We present two new summarization datasets in the domain of newswire. We also introduce a parallel Catalan to English corpus, paired with three different brand new test sets. Finally, we evaluate the data presented with competing state of the art models, and we develop baselines for these tasks using a newly created Catalan BART. We release the resulting resources of this work under open license to encourage the development of language technology in Catalan.

Materials

We openly release the outcome materials produced in the framework of this publication:

Summarization

  • CaSum, a Catalan abstrative summaritzation dataset
  • VilaSum, a Catalan abstrative summaritzation testsets
  • BART-base-ca-casum, a Catalan abstractive summarization model

Machine Translation (soon)

  • GEnCaTA, a Catalan-English high quality corpus for MT
  • Evaluation Resources for Catalan-English MT

Citation

If you use any of these resources (datasets or models) in your work, please cite our latest preprint:

@misc{degibert2022sequencetosequence,
      title={Sequence-to-Sequence Resources for Catalan}, 
      author={Ona de Gibert and Ksenia Kharitonova and Blanca Calvo Figueras and Jordi Armengol-Estapé and Maite Melero},
      year={2022},
      eprint={2202.06871},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

MIT License

Copyright (c) 2022 Text Mining Unit at BSC

About

Sequence to sequence language resources for Catalan and for two tasks, namely: Summarization and Machine Translation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •