Documentation Version 1.0
Corpus Pipeline
Corpora are collected in several stages:
- Obtain facsimile, usually from Google Books
- Correct OCR or transcribe text, marking up structure with TEI
- Tokenize, part-of-speech tag and lemmatize with TreeTagger
- Add corpus specific manual annotations using MS Excel
- Export the merged corpus to persistent formats and the ANNIS search and visualization tool
Corpus Design
For purposes of comparability, we try to select texts from one scientific discipline which is ideally represented in a similar fashion throughout the early modern era. For the first RIDGES corpus we have selected the domain of herbology (Kräuterkunde). The timespan of interest has been divided into 30 year periods, with a currently minimal sample of one text per period. Texts vary somewhat in length since older text is more difficult to annotate. Each document is typically between 4000 and 10000 tokens long.
Annotation Layers
The RIDGES corpora follow a multi-layer design. Annotation layers can be roughly divided into four kinds:
Token Annotations
These annotations always apply to exactly one token. Part-of-speech annotation and lemmatization were carried out with TreeTagger and corrected manually.
PAULA/relANNIS | TEI XML | Description |
---|---|---|
tok | (plain text) | The diplomatic transcription of the word form as found on the manuscript. Line-breaks are marked as in the text, usually as '='. |
norm | N/A | A normalized word form based on Modern German orthography. For words not found in Modern German, a modern orthography is assumed (e.g. beſchicht is normalized as beschieht, analog to geschieht). |
lemma | N/A | The normalized uninflected lexicon entry for each word form, using modern orthography (again, obsolete words are also modernized, e.g. beſchicht has the lemma beschehen, analog to geschehen) |
pos | N/A | Part-of-speech annotation using the STTS tagset for German. |
clean | N/A | Some texts may also have a partially normalized layer with consistent orthography from the relevant period, but not modernized to Modern German orthography. For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form. |
hyperlemma | N/A | In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent is given as a hyperlemma (e.g. Heümonat is hyperlemmatized as Juli or ráß as beißend. |
TEI Metadata
These annotations follow the TEI P5 guidelines.
PAULA/relANNIS | TEI XML | Description |
---|---|---|
meta::author | author | Name of the author (if known). |
meta::bibl | bibl | Full bibliographical entry for the source including the page numbers annotated in the corpus. |
meta::date | date | Date of publication, usually just the year (e.g. "1722"). |
meta::publisher | publisher | Publisher of the document (if known). |
meta::pubPlace | pubPlace | Publication place of the document. |
meta::title | title | Title of the work the document was extracted from. |
TEI Structural Annotations
These annotations follow the TEI P5 guidelines.
PAULA/relANNIS | TEI XML | Description |
---|---|---|
del | del | Area deleted in original text |
unclear | unclear | Unreadable or otherwise unclear text |
atLeast | unclear@atLeast | Minimum prseumed length of unclear text in characters |
atMost | unclear@atMost | Maximum prseumed length of unclear text in characters |
div1 - div5 | div | A subsection of the document. Nesting depth is made explicit by the number after div in the PAULA/relANNIS version |
div1_n - div5_n | div@n | A numbered subsection (the n annotation has the section number as a value, though this may also be a letter such as A or a subsection such as 1.1) |
div1_type - div5_type | div@type | The type of section or subsection. Section can correspond to the entire "book", a "chapter" or smaller sections, including systematic types specific to the genre such as "place" (where a certain herb grows), "form" (descriptions of a herb's form) etc. |
figure | figure | A graphic embedded in the original document. |
figure_rend | figure@rend | Description of the rendering of the figure. |
foreign | foreign | A foreign language area. |
foreign_rend | foreign@rend | Description of the rendering of the foreign language area (e.g. fonts like Antiqua, italics) |
lang | foreign@xml:lang | The language a foreign area is written in (ISO three letter codes according to ISO 3166-1 alpha-3). |
head | head | A heading. |
head_n | head@n | The number of a heading. |
head_rend | head@rend | Description of the rendering of the heading. |
head_type | head@type | Type of heading used, e.g. "margin" for a marginal heading. |
hi | hi | Highlighted area. |
hi_rend | hi@rend | Description of the rendering of the highlighted area. |
lb | lb | Linebreak. |
list | list | A list of items. |
list_type | list@type | The type of list used. |
item | item | Item in a list. |
name | name | A proper name (annotated only in some documents). |
name_type | name@type | The type of proper name (e.g. "person", "herb"). |
note | note | A note in the original document (e.g. footnotes). |
p | p | A paragraph. |
p_n | p@n | The number of a numbered paragraph (this may also be a letter such as A). |
p_rend | p@rend | Description of the rendering of the paragraph. |
pb | pb | Pagebreak. |
pb_ana | pb@ana | Analysis of the pagebreak (e.g. in case of apparently incorrect page numbers). |
pb_n | pb@n | The number of the page (if marked explicitly). |
pb_rend | pb@rend | Description of the rendering of the page (repeated parts of book or chapter titles, redundant confidence texts). |
quote | quote | A quotation (in some documents only). |
reason | unclear@reason | Reason for annotation of the current area (usually describes form of unclear areas). |
ref | ref | Reference to a footnote. |
ref_target | ref@target | ID of the footnote being referred to. |
ref_type | ref@type | Type of reference (e.g. a TEI "noteAnchor"). |
w | w | A word annotated with additional attributes. |
xml_id | fZ (Z is a number) | ID given to a footnote. |
Corpus Specific Annotations
These annotations were developed by our students to annotate spans of tokens with properties of special interest.
PAULA/relANNIS | TEI XML | Description |
---|---|---|
definition | N/A | A Definition. |
term | N/A | A technical term. |
property | N/A | Describes a reference to properties of a herb such as effect, smell etc. |
reader_ref | N/A | References made by authors to the reader. Values indicate the grammatical type of the reference, e.g. "pron2sg" for second person singular pronoun. |
author_ref | N/A | References made by authors to themselves. Values indicate the grammatical type of the reference, e.g. "pron1pl" for first person plural pronoun. |
uncertain | N/A | Annotator uncertain of lemma and/or normalization since no equivalent could be established. |