Documentation Version 3.0

Corpus Linguistics and Morphology | Documentation Version 3.0

Documentation Version 3.0

Corpus pipeline

Extension of version 2.0 with nine additional texts:
Wie sich meniglich
Hortulus Sanitatis
Wund-Arztney
Thesaurus sanitatis
Mysterivm Sigillorvm
Die Einleitung zu der Kräuterkenntnis
Die Eigenschaften aller Heilpflanzen
Vorlesungen über Kräuterkunde
Flora der preussischen Rheinlande
You can find a complete list of all documents of this version in the download section.
Transcription and manual creation and correction of <dipl>, <clean> <norm>
Manual creation and correction of structural annotations and addition of content annotations; technical processing was facilitated by the Excel macros DeleteSpaces (Readme) and SearchAndMerge (Readme).
Tokenize, part-of-speech tag and lemmatize with TreeTagger. Please note: Quotation marks can cause errors and need to be masked. Furthermore empty lines will be deleted by the tree-tagger. Fill those lines with a random tag (e.g. <9>) and use the option -sgml while tagging. Lines that include tags will not be tagged and can be deleted afterwards.
Semi-automatic correction of part of speech in <pos_cor> with DECCA
Automatic creation of <clean_auto> (Python-Skript and Readme)
Automatic creation of <norm_auto> (Bollmann/Petran/Dipper 2011). The quality of the automatic normalisation, lemmatisation and part of speech annotation is not as good as in V2.0. Our dictionary and training corpus for use with the automatic script seems to be too small.
Export the merged corpus to ANNIS

Corpus design

For purposes of comparability, we try to select texts from one scientific discipline which is ideally represented in a similar fashion throughout the early modern era. For the first RIDGES corpus we have selected the domain of herbology (Kräuterkunde). The timespan of interest has originally been divided into 30 year periods. Texts vary somewhat in length since older text is more difficult to annotate. Each document is typically between 4000 and 10000 tokens long. New texts were added in version 3, reducing the timespan of periods.

Annotation layers

The RIDGES corpora follow a multi-layer design. Annotation layers can be roughly divided into five kinds:

Transcription/normalisation
Linguistic annotations
Structural annotations
Content annotation
Metadata

Transcription/normalisation

These annotations always apply to exactly one token. Part-of-speech annotation and lemmatization were carried out with TreeTagger and corrected manually.

Annotation layer and value(s)	Description
dipl annotation value(s): Text	The diplomatic transcription of the word form as found on the manuscript.
clean annotation value(s): Text	Normalizations regarding graphical structures and special characters (e.g. "ſ" to "s"), but not modernized to Modern German orthography. For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form.
clean_auto annotation value(s): Text	Computer-produced normalizations regarding graphical structures and special characters (e.g. "ſ" to "s"), but not modernized to Modern German orthography. For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form.
norm annotation value(s): Text	A normalized word form based on Modern German orthography. Modern flexion is not normalized.
norm_auto annotation value(s): Text	A Computer-produced normalized word form based on Modern German orthography, realized by an algorithm of Bollmann et al. (2011). Modern flexion is not normalized. This normalization is not manually checked and contains some inconsistencies.

Linguistic annotations

Annotation layer and value(s)	Description
pos annotation value(s): STTS	Part-of-speech annotation using the STTS tagset for German.
pos_cor annotation value(s): STTS	Manually corrected part-of-speech annotation using the STTS tagset for German.
lemma annotation value(s): Text (type)	The normalized uninflected lexicon entry for each word form, using modern orthography (again, obsolete words are also modernized, e.g. beſchicht has the lemma beschehen, analog to geschehen)
hyperlemma annotation value(s): Text	In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent is given as a hyperlemma (e.g. Heümonat is hyperlemmatized as Juli or ráß as beißend.
foreign annotation value(s): foreign	Non-german text.
foreign_trans annotation value(s): trans_to_german trans_from_german trans_from_german_extended trans_to_german_extended	Translation from and to German.
lang annotation value(s): ISO 3166-1 alpha-3	Description of the target language and of the source language of a translation.

Structural annotations

Annotation layer and value(s)	Description
lb annotation value(s): lb	Linebreak.
brace annotation value(s): brLeft brRight	Left or right parentheses marking text over multiple lines.
brace_dir annotation value(s): left	Direction of parentheses
p annotation value(s): p	A paragraph.
p_n annotation value(s): Number or letter	The number of a numbered paragraph (this may also be a letter such as A).
p_rend annotation value(s): initial capital big bold type	Description of the rendering of the paragraph.
pb annotation value(s): pb	Pagebreak.
pb_n annotation value(s): Number or Letter	The number of the page (if marked explicitly).
pb_rend annotation value(s): in header: Von Haſelwurtz. Cap. III. in header: Vorred in header: Von Chamillen. Cap. VIII. in header: Vorrede. in header Vorred, signature ´A io`at bottom of page in header: Von Staubwurtz. Cap. II in header: Von Eibisch. Cap. V. in header: Vorred, signature 'A ' at bottom of page in header Vorred, signature'A iiij' at bottom of page in header: Von Wermůt. Cap. I. in header: Vorred, signature 'A iij' at bottom of page in header: Von Drachenwurtz. Cap. IIII. in header: Vorred, signature 'A ij' at bottom of page Ohl zu machen. Zum beſten zu Diſtilliren. Waſſer auß Kräutern vnd dergleichen Auffs beſt zu Diſtilliren. Auß Kräutern vnd dergleichen signature 'A ' at bottom of page Auffs beſt zu Diſtilliren. Waſſer auß Kräutern vnd dergleichen Am beſten zu Diſtilliren.	Description of the rendering of the page (repeated parts of book or chapter titles, redundant confidence texts).
pb_ana annotation value(s): page number should be 7	Analysis of the pagebreak (e.g. in case of apparently incorrect page numbers).
div1 - div5 annotation value(s): div	A subsection of the document. Nesting depth is made explicit by the number after div in the PAULA/relANNIS version
div1_type - div5_type annotation value(s): appendix book chapter description form herb names name nature parts_preparation_and_usus places place preface postscript power reproduction season section species title time utensils	The type of section or subsection. Section can correspond to the entire "book", a "chapter" or smaller sections, including systematic types specific to the genre such as "place" (where a certain herb grows), "form" (descriptions of a herb's form) etc.
div1_n - div5_n annotation value(s): Number	A numbered subsection (the n annotation has the section number as a value, though this may also be a letter such as A or a subsection such as 1.1)
unclear annotation value(s): unclear	Unreadable or otherwise unclear text
atLeast annotation value(s): Number	Minimum presumed length of unclear text in characters
atMost annotation value(s): Number	Maximum presumed length of unclear text in characters
interpretation annotation value(s): Text	Suggestions for unreadable or unclear text
figure annotation value(s): figure table	A graphic embedded in the original document.
figure_rend annotation value(s): Drawing of two jars Drawing of three jars Drawing of two glasses Drawing of three glasses Drawing of two alembics Drawing of an instrument Drawing of an EIBISCH. Drawing of a STAUBWURTZ. Drawing of a KAMILLE. Drawing of a HÜHNERDARM.	Description of the rendering of the figure.
hi annotation value(s): hi	Highlighted area.
hi_font annotation value(s): antiqua fracture	Highlighted area.
hi_rend annotation value(s): italics bold underlined red inicap letter-spacing:1em	Description of the rendering of the highlighted area.
head annotation value(s): head	A heading.
head_n annotation value(s): Number	The number of a heading.
head_rend annotation value(s): red and black red brown	Description of the rendering of the heading.
note annotation value(s): note margin	A note in the original document (e.g. footnotes, margins).
ref annotation value(s): ref	Reference to a footnote.
ref_target annotation value(s): #fZ (Z is a number)	ID of the footnote being referred to.
ref_type annotation value(s): noteAnchor	Type of reference (e.g. a TEI "noteAnchor").
quote annotation value(s): quote	A quotation (in some documents only).
list annotation value(s): list	A list of items.
list_type annotation value(s): simple	The type of list used.
item annotation value(s): item	Item in a list.
xml_id annotation value(s): fZ (Z is a number)	ID given to a footnote.

Content annotations

These annotations were developed by our students to annotate spans of tokens with properties of special interest.

Annotation layer and value(s)	Description
definition annotation value(s): fig expl	A Definition.
disease annotation value(s): di	Mention of a disease, complete phrase.
term annotation value(s): t h d	A technical term, naming of a herb or plant, naming of a disease
reader_ref annotation value(s): pron1pl pron2sg pron3sg pron2pl address	References made by authors to the reader. Values indicate the grammatical type of the reference, e.g. "pron2sg" for second person singular pronoun.
author_ref annotation value(s): pron1pl pron1sg pron2sg pron3sg author	References made by authors to themselves. Values indicate the grammatical type of the reference, e.g. "pron1pl" for first person plural pronoun.
plant annotation value(s): pl	Naming of a plant
property annotation value(s): appearance effect smell preparation taste cultivation	Description of properties like appearance, smell, etc.
name annotation value(s): name	A proper name (annotated only in some documents).
name_type annotation value(s): herb scholar plant person flower tree gardener publisher	The type of proper name (e.g. "person", "herb").

Metadata

These annotations follow the TEI P5 guidelines.

Annotation layer and value(s)	Description
meta::author annotation value(s): author	Name of the author (if known).
meta::bibl annotation value(s): bibl	Full bibliographical entry for the source including the page numbers annotated in the corpus.
meta::date annotation value(s): date	Date of publication, usually just the year (e.g. "1722").
meta::publisher annotation value(s): publisher	Publisher of the document (if known).
meta::pubPlace annotation value(s): pubPlace	Publication place of the document.
meta::title annotation value(s): title	Title of the work the document was extracted from

Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology