Corpora
Overview
Synchronic
German
-
DWDS Core Corpus
http://www.dwds.de/resource/kerncorpus/Corpus of the Berlin-Brandenburgischen Akademie der Wissenschaften, upon which the Digitale Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS) was created.
-
Deutscher Wortschatz Project
http://wortschatz.uni-leipzig.de/Deutscher Wortschatz Online. Contains 35 milion sentences with 500 million words.
-
Hamburg Dependency Treebank
http://hdl.handle.net/11022/0000-0000-7FC7-2The Hamburg Dependency Treebank is to our knowledge the largest dependency treebank available (at the date of its publication). It consists of genuine dependency annotations, i.e. they have not been transformed from phrase structures. The sentences were all sourced from the German news site heise.de, from articles published between 1996 and 2001. The mapping from sentences to articles and authors is retained, allowing, e.g. analysis of individual style. The creation of the treebank through manual annotation was largely interleaved with the creation of a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.
-
IDS-Corpora
http://www.ids-mannheim.de/kt/corpora.htmlCorpora of the Institut für Deutsche Sprache. World-biggest collection of German-language textcorpora used for empirical linguistic research. Online search possible with COSMAS .
-
LIMAS-Korpus
http://www.korpora.org/Limas/Representative corpus of written contemporary German language of the 1970s: 500 texts or fragments, various text genres with a total of 1 million word forms. Can be entirely searched online.
-
Korpus Südtirol
http://www.korpus-suedtirol.it/index_ENAn initiative aiming at the collection, filing and corpus linguistic processing of South Tyrolean German texts.
English
-
British National Corpus (BNC)
http://www.natcorp.ox.ac.ukThe British National Corpus contains 100 million words of written and spoken language from various fields and aims to represent contemporary British English. Also available on CD.
-
American National Corpus (ANC)
http://americannationalcorpus.org/The ANC corpus aims to be American equivalent of the BNC corpus.
-
Loyola Computer-Mediated Communication Corpus
http://cmccorpus.cs.loyola.edu/900 text samples of computer-mediated communication from Loyola College in Baltimore, Maryland (USA)
-
Michigan Corpus of Academic Spoken English: MiCASE
http://quod.lib.umich.edu/m/micase/Freely available, online search function, flat annotation. Comprises 152 Transcriptions ( 1,848,364 Words)
-
International Corpus of English (ICE)
http://ice-corpora.net/ice/Corpuses of regional varieties of English. Each corpus consists of one million words of spoken and wirtten English produced after 1989. Common corpus design and scheme for grammatical annotation. Many of the corpuses are free for non-commercial academic research.
French
-
Corpus de Référence du Français parlé
http://sites.univ-provence.fr/delic/corpus/index.html440,000 words, 134 recordings, over 36 hours of spoken language
-
Un corpus d’entretiens spontanés
http://www.llas.ac.uk/resources/mb/8095 conversations/speakers
Spanish
-
Arthus
http://www.bds.usc.es/corpus.htmlVarious text sorts. Contemporary. All scanned.
Italian
-
CORpus di Italiano Scritto (CORIS)
http://corpora.dslo.unibo.it/coris_eng.html100 million words.
-
Banca dati dell'italiano parlato (BADIP)
http://languageserver.uni-graz.at/badip/badip/home.phpVarious corpora of spoken Italian
-
Corpus OVI dell'Italiano antico (corpus TLIO)
http://www.vocabolario.org/21.817.929 words in 1978 texts
Catalan
-
Corpus del català contemporani
http://www.ub.edu/cccub/Corpus of contemporary colloquial Catalan.
Swedish
-
The Bank of Swedish
http://spraakbanken.gu.se/A linguistic reference databank at the University of Gothenburg.
Czech
-
Cesky Národní Korpus (CNK)
http://ucnk.ff.cuni.czCzech national corpus. Query can be made online or via the GUI "Bonito".
Finnisch
-
The Advanced Finnish Learners’ Corpus
http://www.hum.utu.fi/oppiaineet/suomi/en/research/Siitonen_Ivaska.htmlLongitudinal essay corpus with texts written by students learning Finnish in MA courses.
Russian
-
Narusco
http://narusco.ru/National Corpus of Written Russian
Turkish
-
Turkish National Corpus
http://www.tnc.org.tr/
Multilingual Corpora
-
OPUS - Open Source Parallel Corpus
http://opus.lingfil.uu.se/OPUS comprises 30 million words in 60 languages. The corpus also comprises an Open Office docummentatition (OO), PHP manuals (PHP), and KDE manuals (KDedoc) with KDE system news.
-
Multext Project
http://www.lpl.univ-aix.fr/projects/multext/Multilingual Text Tools And Korpora
-
Multext-East
http://nl.ijs.si/ME/MULTEXT-East is a corpus of 6 language: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovenian. English is the "hub language" of the project.
-
Bohemica.com
http://www.bohemica.com/index.phpTranslation corpus annoted in Czech and English containing 100.000 words (24 written documents consisting each of 1000-4000 words). The corpus contains both fiction and non-fiction and is available for download.
-
RuN-Euro Corpus
http://www.nevmenandr.net/run/index.php#
Parallel corpus originally consisting of Norwegian and Russian texts and other European languages. The texts are aligned at the sentence level and have been tagged for grammatical information at the word level.
Diachronic
German
-
Bibliotheca Augustana
www.fh-augsburg.de/~harsch/augustana.htmllitteraturae et artis collectio
-
Kali Korpus
www.kali.uni-hannover.deThe German Kali corpus (Kali: Korpusarbeit Linguistik, corpus work linguistics) is a partially annotated diachronic corpus, designed for research and teaching. The project started at the end of 2003 for the German course at the University of Hannover under the supvervision of Prof. Gabriele Diewald.
-
Text corpus of Thomas Gloning
http://www.uni-giessen.de/gloning/etexte.htmfreely available
-
Middle High German Corpus (Bochum)
http://www.ruhr-uni-bochum.de/wegera/archiv_1.htm -
Middle High German Terms and Notions Data Bank (Mittelhochdeutsche Begriffsdatenbank, MHDBDB)
http://mhdbdb.sbg.ac.atcontains 4,7 million Words
-
CEEC (Codices Electronici Ecclesiae Coloniensis)
http://www.ceec.uni-koeln.deDigitalised codes of the archiepiscopal diocesis and dome library in Cologne(DDB)
-
TITUS
http://titus.uni-frankfurt.de/indexd.htmIndo-German thesaurus of text and language materials
-
mediavum
http://www.mediaevum.delinks to historical texts
English
-
Penn-Helsinki Parsed Corpus of Middle English
http://www.ling.upenn.edu/midengCorpus comprising prose examples and is annotated syntactically. Structures can be queried. CD-ROM.
-
Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
http://www-users.york.ac.uk/~sp20/corpus.htmlProse examples and syntactically annotated. Structures can be queried. CD-ROM.
-
Lampeter Corpus of Early Modern English
http://khnt.hit.uib.no/icame/manuals/LAMPETER/LAMPHOME.HTM
Collection of texts from various fields, published between 1640 and 1740.
-
Corpus of Early English Correspondence (CEEC)
http://www.helsinki.fi/varieng/domains/CEEC.html2,7 million words. Text published between 1417 and 1681.
-
The English language of the north-west in the late Modern English period: A Corpus of late 18c Prose
http://www.llc.manchester.ac.uk/subjects/lel/staff/david-denison/corpus-late-18th-century-prose/Ca. 300.000 Wörter. Letters between 1761 and 1789.
-
Corpus of Early Modern Playtexts in English: KEMPE
http://corp.hum.sdu.dkCan be queried online; freely available. Part-of-speech (POS) and syntactically annotated corpus of 8.9 million words.
Portuguese
-
O Corpus do Portugues
http://www.corpusdoportugues.org/Corpus of 45 million words, 50,000 texts published between the 14th and 20th century. Lemmas and POS are annotated. A powerful web interface allows searching for information according to texts, registers, dialects, time periods. Also possible are statistical calculations based upon the search results.
-
Tycho Brahe Parsed Corpus of Historical Portuguese
http://www.tycho.iel.unicamp.br/~tycho/corpus/index.htmlSyntactically annotated. Downloadable.
French
-
Frantext
http://zeus.inalf.fr/frantext.htmhttp://setis.library.usyd.edu.au/frantext (description)
Italian
-
Corpus OVI dell'Italiano antico (corpus TLIO)
http://www.vocabolario.org/
21,817,929 Words in 1978 Texts
Dutch
-
Taalbank
http://gtb.inl.nl/
Spanish
-
Corpus del espanol (RAE)
http://www.corpusdelespanol.org/date range: 1200-2000.
Further Resources
-
Technical Report "Eine vergleichende Analyse von historischen und diachronen digitalen Korpora"
http://www.deutschdiachrondigital.de/publikationen/TRHistorischeKorpora.pdf.Authors: Emil Kroymann, Sebastian Thiebes, Anke Lüdeling, Ulf Leser
-
Internet Grammar
http://www.tu-chemnitz.de/phil/english/InternetGrammar/shared/German-English translation corpus. Texts from the last 15 years of politics, tourism, as well as academic texts. 1 million words per language.
-
A Glossarial DataBase of Middle English
http://www.hti.umich.edu/english/gloss -
Johnson's Dictionary
http://www.hti.umich.edu/english/johnsonAccess available via password.
-
Dictionnaire du Moyen francais
http://atilf.atilf.fr/dmf.htm -
Middle English
http://ets.umdl.umich.edu/m/mec/Elektronic version of the Middle English Dictionary
-
The Perseus Digital Library
http://www.perseus.tufts.edu/ -
Celt Corpus of Electronic Texts
http://www.ucc.ie/celt/Online ressource fir Irish history, literature and politics
-
Medievaland Early Modern Data Bank (MEMDB)
http://www.scc.rutgers.edu/memdb/ -
The Thesaurus Linguae Graecae (TLG)
http://www.tlg.uci.edu/ -
The Early Modern English Dictionaries Database (EMEDD)
http://www.chass.utoronto.ca/~ian/emedd.html -
The Patrologia Latina Database (PLD)
http://etext.virginia.edu/pld.htmlComprises the most influential works of Roman and Medieval theology, philosophy, history, and literature. Commercial.
-
A Dictionary of the Welsh Language
http://www.aber.ac.uk/~gpcwww/ -
Thesaurus Lingua Aethiopicae
http://www.uni-mainz.de/Organisationen/TLA/index.html -
Latin and Greek texts
http://www.ulg.ac.be/cipl/bdlasla/ -
Wörterbuchnetz
http://www.woerterbuchnetz.de/Network of dictionaries
-
Electronic Text Corpus of Sumerian Literature (ETCSL)
http://etcsl.orinst.ox.ac.uk/Transcription of clay tablets with over 350 literary works from Mesopotamia (nowadays Iraq) in Sumerian, late 3rd and early 2nd century BCE