A collection of text simplification datasets with a focus on sentence/paragraph/document-level simplification. All contributions are welcome!
For browsing/sorting the datasets, you can use the interactive table.
Notes on the table columns:
- Kind refers to the way simplification instances were obtained. For parallel, this is usually through manual simplification according to specific guidelines. For comparable, this is by automatically mining pairs of complex/simple sentences with similar meaning from a large text corpus.
- Level can be lexical (lex), sentence (sent), paragraph (para) or document (doc).
- Refs refers to the number of references per instance (i.e., gold simplifications).
Dataset | Lang | Domain | Kind | Level | Instances | Refs. | Link |
---|---|---|---|---|---|---|---|
PWKP (Zhu et al., 2010) | EN | Wikipedia | Comparable | Sent | 108,016 paired sentences extracted from 65,133 articles. | 1 | Link |
C&K1 (Coster and Kauchak, 2011) | EN | Wikipedia | Comparable | Sent | 137,000 paired sentences from 10,588 articles. | 1 | Link |
C&K-2 (Kauchak, 2013) | EN | Wikipedia | Comparable | Sent | 167,000 paired sentences. | 1 | Link |
LexMTurk (Horn et al., 2014) | EN | Wikipedia | Parallel | Lex | 500 | multiple | Link |
EW-SEW (Hwang et al., 2015) | EN | Wikipedia | Comparable | Sent | 150,000 full and 130,000 partial matches | 1 | Link (Archived, pre-processed version available here) |
sscorpus (Kajiwara and Komachi, 2016) | EN | Wikipedia | Comparable | Sent | 492,993 aligned sentences from 126K article pairs. | 1 | Link |
TurkCorpus (Xu et al., 2016) | EN | Wikipedia | Parallel | Sent | 2359 sentences (2000 dev, 359 test) | 8 | Link |
NNSEval (Paetzold and Specia, 2016) | EN | Wikipedia | Comparable | Lex | 239 | multiple | Link |
BenchLS (Paetzold and Specia, 2016) | EN | Wikipedia | Comparable | Lex | 929 | multiple | Link |
WikiLarge (Zhang and Lapata, 2017) | EN | Wikipedia | Comparable | Sent | 296,402 sentence pairs (WikiLarge) | 1 | Link |
WikiSmall (Zhang and Lapata, 2017) | EN | Wikipedia | Comparable | Sent | 89,042 sentence pairs | 1 | Link |
WikiSplit (Botha et al., 2018) | EN | Wikipedia | Parallel | Sent | 1 million sentences | 1 | Link |
Hsplit (Sulem et al., 2018) | EN | Wikipedia | Parallel | Sent | 359 sentences (test set of turk corpus) | 4 | Link |
ASSET (Alva-Manchego et al., 2020) | EN | Wikipedia | Parallel | Sent | 2359 sentences (2000 train, 359 test) | 10 | Link |
Wiki-AUTO (Jiang et al., 2020) | EN | Wikipedia | Comparable | Sent | 488,332 train sentences from 138,095 article pairs (2019/09 dump). | 1 | Link (Part of GEM) |
Wikipedia (with context) (Sun et al., 2020) | EN | Wikipedia | Comparable | Sent | 116,020 sentences with context (includes preceding and following sentence) | 1 | Link |
D-Wikipedia (Sun et al., 2021) | EN | Wikipedia | Comparable | Doc | 143,546 article pairs | 1 | Link |
Klexikon (Aumiller and Gertz, 2022) | DE | Wikipedia | Comparable | Doc | 2898 article pairs | 1 | Link |
SWiPE (Laban et al., 2023) | EN | Wikipedia | Comparable | Doc | 145,161 article revision pairs, anntations of fine-grained edit operations on ~5000 articles | 1 | Link |
Dsim (Klerke and Søgaard, 2012) | DA | News | Parallel | Doc | 3,701 articles with 48,186 aligned sentences | 1 | n/a |
Newsela (Xu et al., 2015) | EN | News | Parallel | Doc | 1130 articles (original); 1911 articles (v2016-01-29); at 5 levels | 1 | Link |
Newsela-ES (Xu et al., 2015) | ES | News | Parallel | Doc | 243 articles (v2016-01-29) at 5 levels | 1 | Link |
OneStopEnglish (Vajjala and Lucic, 2018) | EN | News | Parallel | Doc | 189 articles at three levels. Automatic sentence alignment: 1.6K ELE-INT, 2.1K ELE-ADV, 3.1K INT-ADV. | 1 | Link |
Newsela-AUTO (Jiang et al., 2020) | EN | News | Parallel | Sent | 666,645 sentence pairs from 1932 articles at 5 levels | 1 | Link |
20 minutes (Rios et al., 2021) | DE | News | Parallel | Doc | 18,305 articles with simplified summaries. | 1 | Link |
SNIML (Hauser et al., 2022) | DE, EN, FI, FR, IT, SV | News | Simplified only | Doc | 13,447 documents | n/a | Link |
DEplain (Stodden et al., 2023) | DE | News | Parallel | Doc | 500 document pairs in News domain (13k aligned sentences), 150 document pairs in Web domain (2k aligned sentences) | 1 | Link |
SimpleGerman (Klaper et al., 2013) | DE | Web | Comparable | Sent | 7000 sentences from 256 articles. 78% of sentences have an alignment | 1 | n/a (Available on request) |
SimPA (Scarton et al., 2018) | EN | Web | Parallel | Sent | 1100 sentences with 3 lexical, and one 1 syntactic simplification each | 3, 1 | Link |
SimpleGerman V2.0 (Battisti et al., 2020) | DE | Web | Comparable | Doc | 5461 simple, unaligned documents and 378 aligned (complex-simple) documents (6217 docs in total). The document-aligned portion has 17,121 complex sentences and 21,072 simple sentences. No statistics on the sentence-alignments are reported. | 1 | n/a (Scraping code) |
Simple German V3.0 (Toborek et al., 2022) | DE | Web | Comparable | Doc | 708 documents | 1 | n/a (Scraping code) |
PPDB (Ganitkevitch et al., 2013) | EN | Mixed | Comparable | Sent | 221 million sentences | 1 | Link |
Simple-PPDB (Pavlick and Callison-Burch, 2016) | EN | Mixed | Comparable | Sent | 4.5 million sentences | 1 | Link |
WebSplit (Narayan et al., 2017) | EN | Mixed | Comparable | Sent | 1 million sentences | 1 | Link |
EASIER (Alarcon et al., 2021) | ES | Mixed | Parallel | Lex | 5153 | 1-3 | Link |
RuAdapt (Dmitrieva and Tiedemann, 2021) | RU | Books | Parallel | Doc | 457 documents | Link | |
CEFR (Uchida et al., 2018) | EN | Education | Comparable | Lex | 414 | 2.4(avg) | Link |
SIMPLEX-PB-3.0 (Hartmann and Aluisio, 2021) | PT (BR) | Education | Parallel | Lex | 1582 | 7,3(avg) | Link |
PSAT (Taylor et al., 2022) | EN | Education | Parallel | Doc | 112 documents, with total of 1883 aligned sentences | 1 | Link |
Vikidia (Lee and Vajjala, 2022) | EN / FR | Education | Parallel | Doc | 6165 (for each language) | 1 | Link |
CEFR-SP (Arase et al., 2022) | EN | Education | CEFR-level | Sent | 17000 sentences from Newsela-Auto (upon request), Wiki-Auto, and SCoRE dataset | 1 | Link |
CLEAR (Grabar and Cardon, 2019) | FR | Medical | Comparable | Doc | 16190 documents | 1 | Link |
myTomorrows-Wiki (van den Bercken et al., 2019) | EN | Medical | Comparable | Sent | 5415 (manually aligned); 3797 (automatically aligned) | 1 | Link |
MSD-Manuals (Cao et al., 2020) | EN | Medical | Comparable | Sent | 2551 linked paragraphs (professionals <-> laymen) with average of 10.4 and 11.3 sentences each. From a random sample of 1000 paragraphs, medical experts extracted 930 aligned sentences with equivalent meaning. | 1 | Link |
PharmMT (Li et al. , 2020) | EN | Medical | Parallel | Sent | 380,000K aligned sentences. | 1 | n/a |
AutoMeTS (Van et al., 2020) | EN | Medical | Comparable | Sent | 3300 aligned sentences | 1 | Link |
Cochrane (Devaraj et al., 2021) | EN | Medical | Comparable | Par | 4459 paragraph pairs (<1024 tokens) | 1 | Link |
CLARA-MeD (Campillos-Llanos et al., 2022) | ES | Medical | Comparable | Doc | 24298 comparable documents and 3800 parallel sentences | Link | |
BioLaySumm (Goldsack et al., 2022) | EN | Medical | Parallel | Doc | 32353 document-plain abstract pairs | 1 | Link |
CELLS (Guo et al., 2022) | EN | Medical | Comparable | Par | 63000 | 1 | Link |
PLABA (Attal et al., 2023) | EN | Medical | Parallel | Doc | 750 documents with 7643 sentence pairs | 1 | Link |
MultiCochrane (Joseph et al., 2023) | EN, ES, FR, FA | Medical | Comparable | Sent | Cross-lingual pairs; 5K pairs (clean, semi-automatically aligned), 100K pairs (noisy) | 1 | Link |
CLARA-MeD-simp-sent (Campillos-Llanos et al., 2024) | ES | Medical | Parallel | Sent | 1200 manually-simplified sentences | 1 | Link |
SimpMedLexSp (Campillos-Llanos et al., 2024) | ES | Medical | Parallel | Lex | >14000 pairs of medical terms and the corresponding simplified synonym/definition. | 1 | Link |
JASMINE (Horiguchi et al., 2024) | JA | Medical | Parallel | Sent | 1425 sentences (0/425/1000 train/val/test) | 1 | Link (available on request) |
MedLane (Luo et al., 2022) | EN | Clinical | Parallel | Sent | 12,801/1,015/1,016 train/valid/test sentences (avg. 20/24 tokens in source/target) | 1 | Link |
MTSamples (Moramarco et al., 2022) | EN | Clinical | Parallel | Sent | 1250 sentence pairs. | 1 | Link |
SimplePatho (Trienes et al., 2022) | DE | Clinical | Parallel | Doc | 851 documents | 1 | n/a |
FestAbility (Chamovitz and Abend, 2022) | EN | Talks | Parallel | Sent | 321 sentence pairs | 1 | Link |
New entries can be added to data.yml
. Afterwards, run python render.py
and submit a PR with the changes.
This list has greatly benefitted from the survey of Alva-Manchego et al. (2020) and Štajner (2021), as well as notes by Laura Vásquez-Rodríguez. Thanks! Also thanks to @tollefj for adding the interactive table.