Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology

Documentation Version 4.1

Corpus pipeline

  1. The data basis for Ridges Herbology Version 4.1 is Ridges Herbology Version 4.0 including all documented work steps.
  2. The following annotations were added: <komp>, <komp_orth>, <prot>, <attr_gen>, <strD>, <pos_klein>, <Verbposition>, <Nebensatztyp> and <KOUS_Semantik>. Furthermore some of the given annotations were corrected.
  3. The following annotations were added in Gart der Gesundheit: <personenname>, <werkname>, <krankheitsname>, <form_Krankheit>, <kraeutername>, <kraeutername_normiert>, <sprache_kraeutername>, <form_kraeutername>, <kraeuterzubereitung>, <form_zubereitung>, <nomen>, <bemerkung> and <form_nomen>.
  4. Export the merged corpus to ANNIS. For detailed instructions to the conversion-workflow see the conversion-manual.

 

Corpus design

For purposes of comparability, we try to select texts from one scientific discipline which is ideally represented in a similar fashion throughout the early modern era. For the first RIDGES corpus we have selected the domain of herbology (Kräuterkunde). The timespan of interest has originally been divided into 30 year periods. Strings vary somewhat in length since older text is more difficult to annotate. Each document is typically between 4000 and 10000 tokens long. New texts were added in version 3 and 4, reducing the timespan of periods. Jahr-Token-Übersicht zu Ridges V4

 

Annotation layers

The following list outlines the used annotation layer and values. For an overall view of all meta information see the LAUDATIO-Repositorium.For further information about the LAUDATIO-project visit the Project-Homepage of LAUDATIO.

The RIDGES corpora follow a multi-layer design. Annotation layers can be roughly divided into five kinds:

  1. Transcription/normalisation
  2. Linguistic annotations
  3. Structural annotations
  4. Content annotation
  5. Metadata

 

Transcription/normalisation

These annotations always apply to exactly one token. Part-of-speech annotation and lemmatization were carried out with TreeTagger and corrected manually.

Annotation layer and value(s) Description
dipl
annotation value(s):
  • String
The diplomatic transcription of the word form as found on the manuscript. A Unicode-table with special character was used. Line-breaks are marked as in the text, usually as 'U+2E17'.
clean
annotation value(s):
  • String
Computer-produced normalizations regarding graphical structures and special characters (e.g. "ſ" to "s"), but not modernized to Modern German orthography. For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form. The new documents in ridges-V4 put new demands on the clean-tier regarding to vowels with macrons. The normalization of those characters became unpredictable, even with consideration of the context. That's why we decided to replace token containing vowels with macrons by each potential form of that token, separated by '|' (for example: 'auſzwēdig' to: 'auszwemdig|auszwendig'). For a full overview of the replacements for the clean-tier see the Script-Readme that was used.
norm
annotation value(s):
  • String
A normalized word form based on Modern German orthography. Modern flexion is not normalized. For words not found in Modern German, a modern orthography is assumed (e.g. 'beſchicht' is normalized as 'beschieht', analog to 'geschieht').

 

Linguistic annotations

Annotation layer and value(s) Description
pos
annotation value(s):
  • STTS
Semi-automatically corrected (DECCA) part-of-speech annotation using the STTS tagset for German.
lemma
annotation value(s):
  • String (type)
The normalized uninflected lexicon entry for each word form, using modern orthography (again, obsolete words are also modernized, e.g. beſchicht has the lemma beschehen, analog to geschehen).
hyperlemma
annotation value(s):
  • String
In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent is given as a hyperlemma (e.g. Heümonat is hyperlemmatized as Juli or ráß as beißend).
foreign
annotation value(s):
  • foreign
Non-german text.
foreign_trans
annotation value(s):
  • trans_to_german
  • trans_from_german
  • trans_from_german_extended
  • trans_to_german_extended
Translation from and to German.
lang
annotation value(s):
The language a foreign area is written in (ISO three letter codes according to ISO 639-2).
komp
annotation value(s):
  • k
Compound (k).
komp_orth
annotation value(s):
  • zs
  • gtr
  • bs
  • lb1
  • lb2
Annotation of the specific spelling of the annotated compounds (komp): zs: written together, gtr: written separately, bs: hyphenated (one line), lb1: separated by line break (hyphenless), lb2: separated by line break (hyphenated).
prot
annotation value(s):
  • prot1
  • prot2
  • prot3
Assigns prototypes to each compound of the komp-layer : prot1: reliably identifiable as compound, prot2: quite likely a compound und prot3: case of doubt (not assigned in (komp)).
attr_gen
annotation value(s):
  • gprä
  • gpost
Annotation of nominal phrases with genitive attribute post or prenominal. gprä = prenominal genitive, gpost = postnominal genitive.
strD
annotation value(s):
  • strD
Coordination of two compounds (such as: gelb⸗ und Waſſerſucht).
personenname
annotation value(s):
  • String
NA
werkname
annotation value(s):
  • String
NA
krankheitsname
annotation value(s):
  • String
NA
form_Krankheit
annotation value(s):
  • String
NA
kraeutername
annotation value(s):
  • String
NA
kraeutername_normiert
annotation value(s):
  • String
NA
sprache_kraeutername
annotation value(s):
  • String
NA
form_kraeutername
annotation value(s):
  • String
NA
kraeuterzubereitung
annotation value(s):
  • String
NA
form_zubereitung
annotation value(s):
  • String
NA
nomen
annotation value(s):
  • String
NA
form_nomen
annotation value(s):
  • String
NA
bemerkung
annotation value(s):
  • String
NA
pos_klein
annotation value(s):
  • reduced STTS
Reduced STTS-annotation. Some tags like the punctuation-marker $., $,, and $( were grouped.
Verbposition
annotation value(s):
  • V2
  • Vletzt
  • V?
  • V1
Verbposition.V2: Verb second position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. Vletzt: Verb final position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. V?: Unclear verb position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS.V1: Verb first position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS.
Nebensatztyp
annotation value(s):
  • Adverbial
  • Attribut
  • Komplement
Type of subordinating clause. Adverbial: Adverbial function of a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. Attribut: Attributive function of a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. Komplement: Complement function of a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS.
KOUS_Semantik
annotation value(s):
  • additiv
  • final
  • k.a.
  • kausal
  • konditional
  • konsekutiv
  • konzessiv
  • modal
  • temporal
KOUS_Semantik. additiv: Additive semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. final Final semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. k.a.: Non analyzable semantics of subordinated conjunction, due to complement status of subordinated clause; analyzed at occurrences of pos=KOUS.kausal Causal semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. konditional: Conditional semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. konsekutiv: Consecutive semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. konzessiv: Concessive semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. modal: Modal semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. temporal: Temporal semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS.

 

Structural annotations

Annotation layer and value(s) Description
lb
annotation value(s):
  • lb
Linebreak.
brace
annotation value(s):
  • brLeft
  • brRight
Left or right parentheses marking text over multiple lines.
brace_dir
annotation value(s):
  • left
Direction of reading in the text with brackets.
p
annotation value(s):
  • p
A paragraph.
p_n
annotation value(s):
  • Integer or letter
The number or letter of the paragraph (if marked explicitly).
p_rend
annotation value(s):
  • initialCapital
  • bigBoldType
Description of the rendering of the paragraph.
pb
annotation value(s):
  • pb
Pagebreak.
pb_n
annotation value(s):
  • Integer or Letter
The number of the page (if marked explicitly).
pb_rend
annotation value(s):
  • vonHaſelwurtz.Cap.III.
  • vonChamillen.Cap.VIII.
  • vorrede.
  • vorred
  • vonStaubwurtz.Cap.II.
  • vonEibisch.Cap.V.
  • vonWermůt.Cap.I.
  • vonDrachenwurtz.Cap.IIII.
  • ohlZuMachen.
  • zumBeſtenZuDiſtilliren.
  • waſſerAußKräuternVndDergleichen
  • auffsBeſtZuDiſtilliren.
  • außKräuternVndDergleichen
  • waſſerAußKräuternVndDergleichen
  • amBeſtenZuDiſtilliren.
Description of the rendering of the page (repeated parts of book or chapter titles, redundant confidence texts).
pb_ana
annotation value(s):
  • Integer
Revision of the pagebreak (e.g. in case of apparently incorrect page numbers).
div1 - div5
annotation value(s):
  • divINT
A subsection of the document. Nesting depth is made explicit by the number after div (INT) in the PAULA/relANNIS version
div1_type - div5_type
annotation value(s):
  • appendix
  • book
  • chapter
  • description
  • form
  • herb
  • names
  • name
  • nature
  • parts_preparation_and_uses
  • places
  • place
  • preface
  • power
  • reproduction
  • season
  • section
  • species
  • title
  • time
  • utensils
The type of section or subsection. Section can correspond to the entire "book", a "chapter" or smaller sections, including systematic types specific to the genre such as "place" (where a certain herb grows), "form" (descriptions of a herb's form) etc.
div2_n - div3_n
annotation value(s):
  • Integer
A numbered subsection (the 'n' annotation has the section number as a value, though this may also be a letter such as A or a subsection such as 1.1).
unclear
annotation value(s):
  • unclear
Unreadable or otherwise unclear text.
atLeast
annotation value(s):
  • Integer
Minimum presumed length of unclear text in characters.
atMost
annotation value(s):
  • Integer
Maximum presumed length of unclear text in characters.
interpretation
annotation value(s):
  • String
Suggestions for unreadable or unclear text.
figure
annotation value(s):
  • figure
  • table
A graphic or table embedded in the original document.
figure_rend
annotation value(s):
  • drawingOfTwoJars
  • drawingOfThreeJars
  • drawingOfTwoGlasses
  • drawingOfThreeGlasses
  • drawingOfTwoAlembics
  • drawingOfAnInstrument
  • drawingOfAnEibisch
  • drawingOfAStaubwurtz
  • drawingOfAKamille
  • drawingOfAHühnerdarm
  • drawingOfAHelmet
  • drawingOfAFilter
  • drawingOfAWaldenburgischerKolben
  • drawingOfAHaselwurtz
  • drawingOfADrachenwurtz
  • drawingOfAGauchheyl
  • drawingOfADill
  • drawingOfAHauswurz
Description of the rendering of a figure.
hi
annotation value(s):
  • hi
Highlighted area.
hi_font
annotation value(s):
  • antiqua
  • gothic
Annotation of change of font, the main font of the annotated text is set as default value.
hi_rend
annotation value(s):
  • antiqua
  • bold
  • end
  • iniCap
  • italics
  • letter-spacing:1em
  • red
Description of the rendering of the highlighted area.
head
annotation value(s):
  • head
A heading.
head_n
annotation value(s):
  • Integer
The number of a heading.
head_rend
annotation value(s):
  • brown
Description of the rendering of the heading.
note
annotation value(s):
  • note
  • margin
  • end
A note in the original document (e.g. footnotes, margins).
ref
annotation value(s):
  • ref
Reference to a footnote.
ref_target
annotation value(s):
  • #fINT
ID of the footnote being referred to.
ref_type
annotation value(s):
  • noteAnchor
Type of reference (e.g. a TEI "noteAnchor").
quote
annotation value(s):
  • quote
A quotation (in some documents only).
list
annotation value(s):
  • list
A list of items.
list_type
annotation value(s):
  • simple
The type of list used.
item
annotation value(s):
  • item
Item in a list.
xml_id
annotation value(s):
  • fINT
ID given to a footnote.

 

Content annotations

These annotations were developed by our students to annotate spans of tokens with properties of special interest.

Annotation layer and value(s) Description
definition
annotation value(s):
  • fig
  • expl
A Definition of a term or description of a picture.
disease
annotation value(s):
  • di
Mention of a disease, complete phrase.
term
annotation value(s):
  • t
  • h
  • d
  • j
A technical term, naming of a herb (h) or plant (p), naming of a disease (d).
author_ref
annotation value(s):
  • author
  • include
  • other
  • proin1sg
  • pron1pl
  • pron1sg
  • pron2sg
  • pron3sg
  • self
References made by authors to themselves. Values indicate the grammatical type of the reference, e.g. "pron1pl" for first person plural pronoun.
reader_ref
annotation value(s):
  • address
  • adress
  • pron1pl
  • pron2pl
  • pron2sg
  • pron3sg
  • reader
References made by authors to the reader. Values indicate the grammatical type of the reference, e.g. "pron2sg" for second person singular pronoun.
plant
annotation value(s):
  • pl
Naming of a plant
property
annotation value(s):
  • appearance
  • cultivation
  • effect
  • preparation
  • smell
  • taste
Description of properties like appearance, smell, etc.
name
annotation value(s):
  • name
A proper name (annotated only in some documents).
name_type
annotation value(s):
  • flower
  • gardener
  • herb
  • person
  • plant
  • publisher
  • scholar
  • tree
The type of proper name (e.g. "person", "herb").

 

Metadata

These annotations follow the TEI P5 guidelines.

Annotation layer and value(s) Description
meta::author
annotation value(s):
  • String
Name of the author (if known).
meta::bibl
annotation value(s):
  • String
Full bibliographical entry for the source including the page numbers annotated in the corpus.
meta::date
annotation value(s):
  • Integer
Date of publication, usually just the year (e.g. "1722").
meta::publisher
annotation value(s):
  • String
Publisher of the document (if known).
meta::pubPlace
annotation value(s):
  • String
Publication place of the document.
meta::title
annotation value(s):
  • String
Title of the work the document was extracted from.