Documentation Version 4.0
Corpus pipeline
-
Extension of version 3.0 with seven additional texts:
Gart der Gesundheit
Artzney Buchlein der kreutter
Contrafayt kreüterbuch
Paradeißgärtlein
Kräutterbuch des Edelen und hochgelehrten herren
Pflantz-Gart
Der Schweizerische Botanicus
You can find a complete list of all documents of this version in the download section. - Transcription and manual creation and correction of <dipl> and <norm>.
- Manual creation and correction of structural annotations and addition of content annotations; technical processing was facilitated by the Excel macros DeleteSpaces (Readme) and SearchAndMerge (Readme).
- Tokenize, part-of-speech tag and lemmatize with TreeTagger-Batch and TreeTagger. Please note: Quotation marks can cause errors and need to be masked. Furthermore empty lines will be deleted by the tree-tagger. Fill those lines with a random tag (e.g. <9>) and use the option -sgml while tagging. Lines that include tags will not be tagged and can be deleted afterwards.
- Semi-automatic correction of part of speech in <pos> with a modification of DECCA (Dickinson and Meurers 2003) and an additional script (Readme) (Dickinson and Meurers 2003, licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License ).
- Semi-automatic creation of <clean> (Python-Script and Readme).
- Manual correction of <norm> and replacement of all pos-annotations of unreadable tokens; technical processing was facilitated by the macro ReplacePosOfUnclear (Readme) to "XY" in MS Excel.
- Export the merged corpus to ANNIS.
Corpus design
For purposes of comparability, we try to select texts from one scientific discipline which is ideally represented in a similar fashion throughout the early modern era. For the first RIDGES corpus we have selected the domain of herbology (Kräuterkunde). The timespan of interest has originally been divided into 30 year periods. Texts vary somewhat in length since older text is more difficult to annotate. Each document is typically between 4000 and 10000 tokens long. New texts were added in version 3 and 4, reducing the timespan of periods. |
Annotation layers
The RIDGES corpora follow a multi-layer design. Annotation layers can be roughly divided into five kinds:
- Transcription/normalisation
- Linguistic annotations
- Structural annotations
- Content annotation
- Metadata
Transcription/normalisation
These annotations always apply to exactly one token. Part-of-speech annotation and lemmatization were carried out with TreeTagger and corrected manually.
Annotation layer and value(s) | Description |
---|---|
dipl annotation value(s):
|
The diplomatic transcription of the word form as found on the manuscript. A Unicode-table with special character was used. |
clean annotation value(s):
|
Computer-produced normalizations regarding graphical structures and special characters (e.g. "ſ" to "s"), but not modernized to Modern German orthography. For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form. The new documents in ridges-V4 put new demands on the clean-tier regarding to vowels with macrons. The normalization of those characters became unpredictable, even with consideration of the context. That's why we decided to replace token containing vowels with macrons by each potential form of that token, separated by '|' (for example: 'auſzwēdig' to: 'auszwemdig|auszwendig'). For a full overview of the replacements for the clean-tier see the Script-Readme that was used. |
norm annotation value(s):
|
A normalized word form based on Modern German orthography. Modern flexion is not normalized. |
Linguistic annotations
Annotation layer and value(s) | Description |
---|---|
pos annotation value(s):
|
Semi-automatically corrected (DECCA) part-of-speech annotation using the STTS tagset for German. |
lemma annotation value(s):
|
The normalized uninflected lexicon entry for each word form, using modern orthography (again, obsolete words are also modernized, e.g. beſchicht has the lemma beschehen, analog to geschehen). |
hyperlemma annotation value(s):
|
In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent is given as a hyperlemma (e.g. Heümonat is hyperlemmatized as Juli or ráß as beißend). |
foreign annotation value(s):
|
Non-german text. |
foreign_trans annotation value(s):
|
Translation from and to German. |
lang annotation value(s): |
Description of the target language and of the source language of a translation. |
Structural annotations
Annotation layer and value(s) | Description |
---|---|
lb annotation value(s):
|
Linebreak. |
brace annotation value(s):
|
Left or right parentheses marking text over multiple lines. |
brace_dir annotation value(s):
|
Direction of parentheses |
p annotation value(s):
|
A paragraph. |
p_n annotation value(s):
|
The number of a numbered paragraph (this may also be a letter such as A). |
p_rend annotation value(s):
|
Description of the rendering of the paragraph. |
pb annotation value(s):
|
Pagebreak. |
pb_n annotation value(s):
|
The number of the page (if marked explicitly). |
pb_rend annotation value(s):
|
Description of the rendering of the page (repeated parts of book or chapter titles, redundant confidence texts). |
pb_ana annotation value(s):
|
Revision of the pagebreak (e.g. in case of apparently incorrect page numbers). |
div1 - div5 annotation value(s):
|
A subsection of the document. Nesting depth is made explicit by the number after div in the PAULA/relANNIS version |
div1_type - div5_type annotation value(s):
|
The type of section or subsection. Section can correspond to the entire "book", a "chapter" or smaller sections, including systematic types specific to the genre such as "place" (where a certain herb grows), "form" (descriptions of a herb's form) etc. |
div2_n - div3_n annotation value(s):
|
A numbered subsection (the n annotation has the section number as a value, though this may also be a letter such as A or a subsection such as 1.1). |
unclear annotation value(s):
|
Unreadable or otherwise unclear text. |
atLeast annotation value(s):
|
Minimum presumed length of unclear text in characters. |
atMost annotation value(s):
|
Maximum presumed length of unclear text in characters. |
interpretation annotation value(s):
|
Suggestions for unreadable or unclear text. |
figure annotation value(s):
|
A graphic embedded in the original document. |
figure_rend annotation value(s):
|
Description of the rendering of a figure. |
hi annotation value(s):
|
Highlighted area. |
hi_font annotation value(s):
|
Annotation of change of font, the main font of the annotated text is set as default value. |
hi_rend annotation value(s):
|
Description of the rendering of the highlighted area. |
head annotation value(s):
|
A heading. |
head_n annotation value(s):
|
The number of a heading. |
head_rend annotation value(s):
|
Description of the rendering of the heading. |
note annotation value(s):
|
A note in the original document (e.g. footnotes, margins). |
ref annotation value(s):
|
Reference to a footnote. |
ref_target annotation value(s):
|
ID of the footnote being referred to. |
ref_type annotation value(s):
|
Type of reference (e.g. a TEI "noteAnchor"). |
quote annotation value(s):
|
A quotation (in some documents only). |
list annotation value(s):
|
A list of items. |
list_type annotation value(s):
|
The type of list used. |
item annotation value(s):
|
Item in a list. |
xml_id annotation value(s):
|
ID given to a footnote. |
Content annotations
These annotations were developed by our students to annotate spans of tokens with properties of special interest.
Annotation layer and value(s) | Description |
---|---|
definition annotation value(s):
|
A Definition. |
disease annotation value(s):
|
Mention of a disease, complete phrase. |
term annotation value(s):
|
A technical term, naming of a herb (h) or plant (p), naming of a disease (d). |
author_ref annotation value(s):
|
References made by authors to themselves. Values indicate the grammatical type of the reference, e.g. "pron1pl" for first person plural pronoun. |
reader_ref annotation value(s):
|
References made by authors to the reader. Values indicate the grammatical type of the reference, e.g. "pron2sg" for second person singular pronoun. |
plant annotation value(s):
|
Naming of a plant |
property annotation value(s):
|
Description of properties like appearance, smell, etc. |
name annotation value(s):
|
A proper name (annotated only in some documents). |
name_type annotation value(s):
|
The type of proper name (e.g. "person", "herb"). |
Metadata
These annotations follow the TEI P5 guidelines.
Annotation layer and value(s) | Description |
---|---|
meta::author annotation value(s):
|
Name of the author (if known). |
meta::bibl annotation value(s):
|
Full bibliographical entry for the source including the page numbers annotated in the corpus. |
meta::date annotation value(s):
|
Date of publication, usually just the year (e.g. "1722"). |
meta::publisher annotation value(s):
|
Publisher of the document (if known). |
meta::pubPlace annotation value(s):
|
Publication place of the document. |
meta::title annotation value(s):
|
Title of the work the document was extracted from |