Skip to content

Text corpora in languages other than English. Curated with an eye towards digital humanities use.

Notifications You must be signed in to change notification settings

multilingual-dh/multilingual-corpora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 

Repository files navigation

Multilingual Corpora

Text corpora in languages other than English. Curated with an eye towards digital humanities use.

This is not a directory but a moderately-opinionated, potentially one-time list of resources that might be of use to digital humanities folks working with languages other than English. That said, if you have suggestions, you can make a pull request. Or, fill out this form.

Chinese

  • China Historical GIS: "comprehensive series of datasets related to the administrative geography of Chinese History. The data layers include nationwide coverages (for the years 1820 and 1911), and time series (for the Dynastic period from 221 BCE to 1911 CE). The administrative features include Provinces, Circuits, Prefectures, and Counties as they changed over time."
  • Chinese Biographical Database Project (CBDB): Harvard project, freely accessible relational database with biographical information about approximately 491,000 individuals as of May 2021, primarily from the 7th through 19th centuries.
  • CNKI - 中国知网: well-supported (and funded), easily accessible, some censorship and missing articles.
  • ctext.org: online open-access digital library, with the full text of various Chinese texts of philosophical, historical, or linguistic interest from the pre-Qin era through to the Han dynasty and beyond.
  • Scripta Sinica - 漢籍全文資料庫 : 1,349 new titles and 754,200,198 characters of materials pertaining to the traditional Chinese classics
  • The Bookshelf: images + text for rare and ancient books.

French

  • Epistemological Letters: correspondence in English, German, and French about the field of physics between November 1973 and October 1984.

German

  • EpiDat database of Jewish tombstones (includes Jewish tombstones in Hebrew as well). As a database for Jewish gravestone epigraphy, epidat is used to inventory, document, edit and present epigraphic holdings. Currently inscriptions of Jewish cemeteries from nine centuries and six countries are made available via chronological, spatial and thematic approaches.
  • Epistemological Letters: correspondence in English, German, and French about the field of physics between November 1973 and October 1984.

Hebrew

  • EpiDat database of Jewish tombstones (includes Jewish tombstones in German as well). As a database for Jewish gravestone epigraphy, epidat is used to inventory, document, edit and present epigraphic holdings. Currently inscriptions of Jewish cemeteries from nine centuries and six countries are made available via chronological, spatial and thematic approaches.

Japanese

  • Aozora Search: digitized text with Philologic text mining tools
  • SAT Daizōkyō Text Database: full text of 85 volumes of Taishō Shinshū Daizōkyō (大正新脩大藏經). Digitizing and encoding project also encoding new characters.
  • Digital Tale of Genji
  • Organization: East Asia TEI Special Interest Group run by Kiyonori Nagasaki & A. Charles Muller with a wiki and GitHub.

(Ottoman) Turkish

About

Text corpora in languages other than English. Curated with an eye towards digital humanities use.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published