Skip to content

Linguistic resources for several of the tools included in the Text Tonsorium

Notifications You must be signed in to change notification settings

kuhumcst/texton-linguistic-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repositorium contains linguistic resources for several of the tools included in the Text Tonsorium (https://github.com/kuhumcst/texton).

Sources, Licences, Credits

The resources can be traced back to many different sources. Some resources are straight copies of freely accessible data, other resources are date created by some training algorithm. Resources in the latter category do not make it possible to recreate the training data.

The resources for the tokeniser (lists of abbreviations) are obtained from Wikipedia.

The list below is ordered according to language and tool.

Sources for multiple languages

CELEX

Cite: R.H. Baayen and R. Piepenbrock and L. Gulikers. (1995). CELEX. ELRA, 3.1, ISLRN 302-530-620-279-0.

Link:CELEX

MULTEXT-East free lexicons 4.0

Link: https://www.clarin.si/repository/xmlui/handle/11356/1041

Licence: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Cite: Erjavec, Tomaž; et al., 2010, MULTEXT-East free lexicons 4.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1041.

MULTEXT-East non-commercial lexicons 4.0

Link: https://www.clarin.si/repository/xmlui/handle/11356/1042

Licence: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

Cite: Erjavec, Tomaž; et al., 2010, MULTEXT-East non-commercial lexicons 4.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1042.

Sources per language

af Afrikaans

lemmatiser

Link: https://github.com/UniversalDependencies/UD_Afrikaans-AfriBooms

bg Bulgarian

lemmatiser

MULTEXT-East free lexicons 4.0

cs Czech

lemmatiser

MULTEXT-East free lexicons 4.0

da Danish

tagger

Medieval: Middelaldertekster Dansk Sprog- og Litteraturselskab, Clara-Kloster/Guldkorpus University of Copenhagen

Late modern, Contemporary: Parole corpus Dansk Sprog- og Litteraturselskab

lemmatiser

Medieval: Middelaldertekster Dansk Sprog- og Litteraturselskab, Clara-Kloster/Guldkorpus University of Copenhagen

Late modern: Ordbog over det Danske Sprog (Dansk Sprog- og Litteraturselskab)

Contemporary: CST. (2004). STO: Sprogteknologisk orddatabase over det danske sprog. Center for Sprogteknologi, Department of Nordic Studies and Linguistics, University of Copenhagen. CLARIN-DK-UCPH Repository

de German

lemmatiser

CELEX

el Greek

lemmatiser

G. Petasis

Cite: G., Karkaletsis, V., Farmakiotou, D., Androutsopoulos, I., and Spyropoulos, C. D. (2001). A Greek Morphological Lexicon and its Exploitation by a Greek Controlled Language Checker. In Proceedings of the 8th Panhellenic Conference on Informatics (PCI’01), PCI’01, pages 80–89, November 8–10.

Cite: Petasis, G., Karkaletsis, V., Farmakiotou, D., Androutsopoulos, I., and Spyropoulos, C. D. (2003). A Greek Morphological Lexicon and Its Exploitation by Natural Language Processing Applications. In Yannis Manolopoulos, et al., editors, *Advances in Informatics

  • Post-proceedings of the 8th Panhellenic Conference in Informatics, volume 2563 of Lecture Notes in Computer Science*, pages 401–419. Springer Berlin / Heidelberg.

Link: https://www.ellogon.org/petasis/

en English

lemmatiser

CELEX

Brill-tagger

Cite:Eric Brill. 1992. A simple rule-based part of speech tagger. In Proceedings of the third conference on Applied natural language processing (ANLC '92). Association for Computational Linguistics, Stroudsburg, PA, USA, 152-155. doi:10.3115/974499.974526

es Spanish

lemmatiser

lachica

Link: https://github.com/bumshmyak/lachica

Cite: Bum Shmyak. (2011). Spanish lemmatization.

et Estonian

lemmatiser

MULTEXT-East free lexicons 4.0

Cite: Alexander Tkachenko. (2015). Suffix Lemmatizer for Estonian. https://github.com/estnltk/suffix-lemmatizer.

fa Persian

lemmatiser

MULTEXT-East non-commercial lexicons 4.0

fr French

lemmatiser

Cite: Boris New and Christophe Pallier. (2005). Une Base de Données Lexicales Libre.

Link: www.lexique.org

Limsi

Link: https://perso.limsi.fr/anne/OLDlexique.txt

hr Croatian

lemmatiser

The SETimes.HR+ Croatian dependency treebank

Link: http://nlp.ffzg.hr/

Link: https://github.com/ffnlp/sethr

hu Hungarian

lemmatiser

MULTEXT-East free lexicons 4.0

is Icelandic

lemmatiser

the Icelandic Centre for Language Technology IFD

Link: http://malfong.is/?pg=ordtidnibok

Cite: Jörgen Pind and Fririk Magnússon and Stefán Briem. (1991). IFD. the Icelandic Centre for Language Technology IFD.

it Italian

lemmatiser

Morph-it!

Link: https://docs.sslmit.unibo.it/doku.php?id=resources:morph-it

Cite: Zanchetta, E. and Baroni, M. (2005). Morph-it! a free corpus-based morphological resource for the italian language. Corpus Linguistics 2005, 1(1).

Cite: Marco Baroni and Eros Zanchetta. (2009). Morph-it! Department of Interpreting and Translation - Forl`ı Campus Corpora, Linguistics, Technology Research centre CoLiTec, 0.48.

Licence: Dual-licensed free software; you can redistribute it and/or modify it under the terms of the under the Creative Commons Attribution ShareAlike 2.0 License and the GNU Lesser General Public License.

la Latin

lemmatiser, tagger

James Artz and Calliopi Dourou and J. F. Gentile and Kenny Hickman and Alex Lessie and Viet Luong and Meg Luthin and Molly Miller and Robin Ngo and Skylar Neil and Tufts University LAT-181 class. (2008). Latin Dependency Treebank. Perseus Digital Library, 1.5.

Link: Perseus Digital Library

Ján Šipoš. (2015). Latin lemmata.

Link: Latin lemmata.

mk Macedonian

lemmatiser

MULTEXT-East non-commercial lexicons 4.0

nl Dutch

lemmatiser

CELEX

no Norwegian

lemmatiser

Cite: Koenraad de Smedt. (1999). Scarrie Lexicon. Meta Nord.

pl Polish

lemmatiser

Marcin Miłkowski and Dawid Weiss. (2016). Morfologik.

Link: Morfologik 1.5

pt Portuguese

lemmatiser

LABELLEX Link: https://label.ist.utl.pt/en/labellex_en.php

Cite: Samuel Eleut´erio and Elisabete Ranchhod. (2014). LABEL-LEX MW. ELRA, ISLRN 502-837-497-805-9.

ro Romanian

lemmatiser

MULTEXT-East free lexicons 4.0

ru Russian

lemmatiser

Alexander Pankov and Arsen Gadjikurbanov and Sergey Bochenkov. (2011). libturglem.

Link: libturglem-0.2.30.

sk Slovak

lemmatiser

MULTEXT-East free lexicons 4.0

sl Slovene

lemmatiser

MULTEXT-East free lexicons 4.0

sr Serbian

lemmatiser

MULTEXT-East non-commercial lexicons 4.0

sv Swedish

lemmatiser

Cite: Språkrådet. (2007). Lexin. Institutet för språk och folkminnen and Kungliga Tekniska högskolan.

Licence: CC-BY (attribution)

uk Ukrainian

lemmatiser

MULTEXT-East free lexicons 4.0

About

Linguistic resources for several of the tools included in the Text Tonsorium

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages