Documentation Version 1.0

Corpus Linguistics and Morphology | Documentation Version 1.0

Documentation Version 1.0

Corpus Pipeline

Corpora are collected in several stages:

Obtain facsimile, usually from Google Books
Correct OCR or transcribe text, marking up structure with TEI
Tokenize, part-of-speech tag and lemmatize with TreeTagger
Add corpus specific manual annotations using MS Excel
Export the merged corpus to persistent formats and the ANNIS search and visualization tool

Corpus Design

For purposes of comparability, we try to select texts from one scientific discipline which is ideally represented in a similar fashion throughout the early modern era. For the first RIDGES corpus we have selected the domain of herbology (Kräuterkunde). The timespan of interest has been divided into 30 year periods, with a currently minimal sample of one text per period. Texts vary somewhat in length since older text is more difficult to annotate. Each document is typically between 4000 and 10000 tokens long.

Annotation Layers

The RIDGES corpora follow a multi-layer design. Annotation layers can be roughly divided into four kinds:

Token Annotations
TEI metadata
Structural TEI annotations
Corpus-specific annotations

Token Annotations

These annotations always apply to exactly one token. Part-of-speech annotation and lemmatization were carried out with TreeTagger and corrected manually.

PAULA/relANNIS	TEI XML	Description
tok	(plain text)	The diplomatic transcription of the word form as found on the manuscript. Line-breaks are marked as in the text, usually as '='.
norm	N/A	A normalized word form based on Modern German orthography. For words not found in Modern German, a modern orthography is assumed (e.g. beſchicht is normalized as beschieht, analog to geschieht).
lemma	N/A	The normalized uninflected lexicon entry for each word form, using modern orthography (again, obsolete words are also modernized, e.g. beſchicht has the lemma beschehen, analog to geschehen)
pos	N/A	Part-of-speech annotation using the STTS tagset for German.
clean	N/A	Some texts may also have a partially normalized layer with consistent orthography from the relevant period, but not modernized to Modern German orthography. For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form.
hyperlemma	N/A	In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent is given as a hyperlemma (e.g. Heümonat is hyperlemmatized as Juli or ráß as beißend.

TEI Metadata

These annotations follow the TEI P5 guidelines.

PAULA/relANNIS	TEI XML	Description
meta::author	author	Name of the author (if known).
meta::bibl	bibl	Full bibliographical entry for the source including the page numbers annotated in the corpus.
meta::date	date	Date of publication, usually just the year (e.g. "1722").
meta::publisher	publisher	Publisher of the document (if known).
meta::pubPlace	pubPlace	Publication place of the document.
meta::title	title	Title of the work the document was extracted from.

TEI Structural Annotations