Documentation Version 5.0

Corpus Linguistics and Morphology | Documentation Version 5.0

Documentation Version 5.0

Corpus pipeline

Seven additional texts were added:
BuchDerNatur_1482_vonMegenberg
NewKreuetterBuch_1539_Bock
NewKreueterbuch_1563_Handsch
Phythologia_1662_Becher
TheatrumBotanicum_1696_Verzascha
ViridariumReformatum_1719_Valentini
Kraeuterbuch_1914_Losch
You can find a complete list of all documents of this version in the download section.
Transcription and tokenization of the seven new texts with TreeTagger.
Normalization of the seven new texts (<norm> layer). Correction of the <norm> layer and <dipl> layer of all texts published before 1652.
Part-of-speech tagging and lemmatization with TreeTagger-Batch and TreeTagger for all documents (4.1 and 5.0). Please note: Quotation marks can cause errors and need to be masked. Furthermore empty lines will be deleted by the TreeTagger. Fill those lines with a random tag (e.g. <9>) and use the option -sgml while tagging. Lines that include tags will not be tagged and can be deleted afterwards. After merging TreeTagger-output with the MS Excel file, the MS Excel macro SearchAndMerge (Readme) reconstruct the segmentation.
Manual creation and correction of structural and content annotations in MS Excel.
Automatic replacement of specific special characters in <dipl> of all documents with NormalizeDipl (e.g. all macrons were replaced by tildes).
Automatic creation of <clean> for all documents(Python-Script and Readme, this version of the script works with Python 2.x only).
Manual correction of <norm> and replacement of all pos-annotations of unreadable tokens; technical processing was facilitated by the macro ReplacePosOfUnclear (Readme) to "XY" in MS Excel.
Conversion from MS Excel to ANNIS format and PAULA format via Pepper.

Corpus design

In order to study the development of the scientific language throughout the period of interest, we require a subject domain that is sufficiently well represented in all subperiods. That is why we have selected the domain of herbology (Kräuterkunde). Texts vary somewhat in length since older text is more difficult to annotate.

Annotation layers

The RIDGES corpus is designed as a multi-layer architecutre. Annotation layers can be roughly divided into five kinds:

Transcription/normalisation
Linguistic annotations
Structural annotations
Content annotation
Metadata

Transcription/normalisation

Annotation layer and value(s)	Description
dipl independent segmentation annotation value(s): Text	The diplomatic transcription of the word form as found on the manuscript. A Unicode-table with special character is used.
clean independent segmentation annotation value(s): Text	Automatic normalization by a Python-Script regarding graphical structures and special characters only (e.g. "ſ" to "s"). For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form. For words including line breaks notice, that if the second word begins with a capital letter, this letter will be normalized to a small letter in the clean layer (e.g. "Gelb- Sucht" to "Gelbsucht"). If all letters of the second word are capital letters, they will remain the same (e.g. "MON- TANUM" to "MONTANUM"). Dipl units containing vowels with macrons are replaced by each potential form of that token, separated by '\|' (for example: 'auſzwēdig' to: 'auszwemdig\|auszwendig'). For a full overview of the replacements for the clean-tier see the Readme.
norm independent segmentation annotation value(s): Text	In this layer the segmentation, graphemics, inflection forms and lexemes are normalized. Graphemics: orthographic normalization according to Duden (e.g. kreutter -> Kräuter); phonology: please notice the sound changes of the Early New High German period, like diphthongization, monophthongization, syncope, apocoke, etc. (e.g. lehret -> lehrt); morphology: in die Nasen -> in die Nase; lexicology: extinct lexical material is normalized according to modern orthography and described in the layer "erlaeuterung" as the case may be (e.g. Vergeſz -> Vergess); word formation: extinct morphemes are normalized - if possible - according to modern orthography (e.g. halben -> halber or stachelecht -> stachelig). Currently there are only some documents in which case was normalized.

Linguistic annotations

Annotation layer and value(s)	Description
pos segmentation based on 'norm' annotation value(s): STTS	Autmatic part-of-speech annotation using the STTS tagset for German.
lemma segmentation based on 'norm' annotation value(s): String	Automatic lemmatization by TreeTagger.
erlaeuterung segmentation based on 'dipl' annotation value(s): String	This is an unsystematic layer. In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent (e.g. Heümonat -> Juli) or an explanation can be given. This layer was originally called "hyperlemma" and was renamed as "erlaeuterung" in ridge-v5.
foreign segmentation based on 'dipl' annotation value(s): foreign	Non-german text.
foreign_trans segmentation based on 'dipl' annotation value(s): trans_to_german trans_from_german trans_from_german_extended trans_to_german_extended	Translation from and to German.
lang segmentation based on 'dipl' annotation value(s): ISO 639-2	The language of a foreign area is written in (ISO three letter codes according to ISO 639-2).
komp segmentation based on 'dipl' annotation value(s): k	Nominal compound (with nominal head).
komp_orth segmentation based on 'dipl' annotation value(s): zs gtr bs lb1 lb2	Annotation of the specific spelling of the annotated compounds (komp): zs: written together, gtr: written separately, bs: hyphenated (one line), lb1: separated by line break (hyphenless), lb2: separated by line break (hyphenated).
prot segmentation based on 'dipl' annotation value(s): prot1 prot2 prot3	Assigns prototypes to each compound of the komp-layer : prot1: reliably identifiable as compound, prot2: quite likely a compound und prot3: case of doubt (not assigned in (komp)).
attr_gen segmentation based on 'dipl' annotation value(s): gprä gpost	Annotation of nominal phrases with genitive attribute post or prenominal. gprä = prenominal genitive, gpost = postnominal genitive.
strD segmentation based on 'dipl' annotation value(s): strD	Coordination of compounds and parts of compounds(truncated morphemes and compounds such as: gelb⸗ und Waſſerſucht).
personenname segmentation based on 'dipl' annotation value(s): String	Every name of a person to which the author of a particular document refers is annotated. For every instance the name of the person is given in the nominative form.
werkname segmentation based on 'dipl' annotation value(s): String	Every title of books to which the author of a particular document refers is annotated. For every instance the title of the book is given in the nominative form.
form_krankheit segmentation based on 'dipl' annotation value(s): String	NA
problem segmentation based on 'dipl' annotation value(s): String	NA
kraeutername_normiert segmentation based on 'dipl' annotation value(s): String	In this layer a systematic herbal name is given. Sometimes it is ambigous - in this case you can find additional information in the "erlaeuterung" or in the "bemerkung_lexik" layer.
kraeuterzubereitung segmentation based on 'dipl' annotation value(s): String	This layer was made for the identification of preparations of herbs. Only those instances are included which are NPs or modifiers with a herb as head. The name is given in the nominative singular form and normalized according to modern orthography. Whitespaces are replaced by underscores. Compounds are always written together, regardless of their compound spelling in the facsimile. Everything is written in lower case letters (e.g. safft des weremuts -> saft_des_wermuts.
form_zubereitung segmentation based on 'dipl' annotation value(s): kompNN kompNNgetrennt phraseVON phraseGEN	In this layer preparations with herbs are described syntactically or morphologically. kompNN = NN compounds which are written together or hyphenated; kompNNgetrennt = nouns following each other which could be a compound (written seperatlely); phraseVON = preparations with herbs containing a von-PP (e.g. safft von weremut); phraseGEN = preparation with herbs containing a genitive attribute (e.g. safft des weremuts.
nomen_nominativ segmentation based on 'dipl' annotation value(s): String	In this layer all nouns which are included in the text are given, namely in the first occurring spelling in the nominative singular form. If the first occurring form of "Saft" is safft, all further incidences of "Saft" are given as safft. Everything is written in lower case letters. The purpose of this layer is to investigate the variation of noun spelling within one text.
form_nomen segmentation based on 'dipl' annotation value(s): simplex kompNN kompNNgetrennt kompAN kompVN kompPN derivat nom gri lat lex name	In this layer all nouns are were morphologically annotated. kompNN = NN compund, written together or hyphenated; kompNNgetrennt = all sequences of nouns which could be compounds, but are written separately; kompAN = AN compounds; kompVN = VN compounds; kompPN = PN compounds; derivat = derivates; nom = all implicit nominalizations (conversions, formations with ablaut, syntactic nominalizations, z.B. (die) sucht, (das) kalt; gri/lat/ara = clear Greek/Latin/Arabic nouns, already in the German language integrated foreign material is treated like native words; lex = lexicalized herb names which were originally morphologically complex, e.g. Beifuß, Wermut, Stabwurz, and tausend guldin for "Tausendguldenkraut".
bemerkung_lexik segmentation based on 'dipl' annotation value(s): String	This is an unsystematic layer for comments and questions about lexis.
satztyp segmentation based on 'dipl' annotation value(s): rs padv rsx rsdem padvpart dem part	Annotation of sentence types. No hierarchical annotation. For nested sentences only the highest clause is annotated. In the layer "bemerkungen_syntax" you can find notes about the nestings. rs = clear relative clauses, both "w-relative clauses" and "d-relative clauses"; padv = clauses which are introduced by a pronominal adverb; rsx = relative clauses without main clause (this often occurs in headlines); rsdem = ambiguous cases: relative clause or demonstrative clause; padvpart = clauses with pronominal adverb and participle; dem = demonstrative clauses (all clauses with a demonstrative pronoun as subejct); part = participles that behave similarly like relative clauses.
position_im_satz segmentation based on 'dipl' annotation value(s): vor nach int	Position of the relative clause within the main clause. vor = preposed; nach = postposed; int = embedded.
position_zur_bezugskategorie segmentation based on 'dipl' annotation value(s): adja-v adja-n dist na	Position of the relative clause relative to the reference category. adja-v = adjacently preposed; adja-n = adjacently postposed; dist = distant; na = not applicable.
form_bezugskategorie segmentation based on 'dipl' annotation value(s): np d-pron p-pron satz null	Form of the reference category of the relative clause. np = non-pronominal NP; d-pron = der, die, das, dieser, etc.; p-pron = personal pronoun; satz = sentences (for continuative relative clauses which refer to the state of affairs in the whole reference clause); null = for free relative or asyndetic relative clauses with a covert correlate in the main clause).
verbstellung segmentation based on 'dipl' annotation value(s): v2 ve venf amb	Verb position within the relative clause. v2 = verb second; ve = verb end; venf = verb end with occupied postfield; amb = ambiduous: v2 or ve (e.g. for intransitive verbs).
form_des_relativpronomens segmentation based on 'dipl' annotation value(s): d-pron w-pron w-phras	Form of the category which introduces the relative clause.d-pron = all d-pronouns; w-pron = wer, welch-; w-phras = e.g. welch frau
modifikation_bezugskategorie segmentation based on 'dipl' annotation value(s): relsatz	Annotated on pronouns, NPs or clauses, if modified by a relative clause. Not applicable for free relative clauses. The whole reference category is annotated as span.
pos_klein segmentation based on 'norm' annotation value(s): reduced STTS	Reduced STTS-annotation. Some tags like the punctuation-marker $., $,, and $( were grouped.
Verbposition segmentation based on 'norm' annotation value(s): V2 Vletzt V? V1	Verbposition.V2: Verb second position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. Vletzt: Verb final position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. V?: Unclear verb position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS.V1: Verb first position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS.
Nebensatztyp segmentation based on 'norm' annotation value(s): Adverbial Attribut Komplement	Type of subordinating clause. Adverbial: Adverbial function of a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. Attribut: Attributive function of a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. Komplement: Complement function of a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS.
KOUS_Semantik segmentation based on 'norm' annotation value(s): additiv final k.a. kausal konditional konsekutiv konzessiv modal temporal	KOUS_Semantik. additiv: Additive semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. final Final semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. k.a.: Non analyzable semantics of subordinated conjunction, due to complement status of subordinated clause; analyzed at occurrences of pos=KOUS.kausal Causal semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. konditional: Conditional semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. konsekutiv: Consecutive semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. konzessiv: Concessive semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. modal: Modal semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. temporal: Temporal semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS.
diachronie segmentation based on 'dipl' annotation value(s): String	In this layer linguistic phenomenons and word forms are unsystematically associated with a linguistic field (syntax, morphosyntax, phonology, graphemics, morphology). It gives good examples for didactic purposes.
dialekt segmentation based on 'dipl' annotation value(s): String	In this unsystematic layer word forms, which give information about the diatopic quality of the text, are annotated and associated with a specific dialect area as accurate as possible.

Structural annotations

Annotation layer and value(s)	Description
lb segmentation based on 'dipl' annotation value(s): lb	Linebreak.
brace segmentation based on 'dipl' annotation value(s): brLeft brRight	Left or right parentheses marking text over multiple lines.
brace_dir segmentation based on 'dipl' annotation value(s): left	Direction of reading in the text with brackets.
p segmentation based on 'dipl' annotation value(s): p	A paragraph.
p_n segmentation based on 'dipl' annotation value(s): Integer or letter	The number or letter of the paragraph (if marked explicitly).
p_rend segmentation based on 'dipl' annotation value(s): initialCapital bigBoldType	Description of the rendering of the paragraph.
pb segmentation based on 'dipl' annotation value(s): pb	Pagebreak.
pb_n segmentation based on 'dipl' annotation value(s): Integer or Letter	The number of the page (if marked explicitly).
pb_rend segmentation based on 'dipl annotation value(s): vonHaſelwurtz.Cap.III. vonChamillen.Cap.VIII. vorrede. vorred vonStaubwurtz.Cap.II. vonEibisch.Cap.V. vonWermůt.Cap.I. vonDrachenwurtz.Cap.IIII. ohlZuMachen. zumBeſtenZuDiſtilliren. waſſerAußKräuternVndDergleichen auffsBeſtZuDiſtilliren. außKräuternVndDergleichen waſſerAußKräuternVndDergleichen amBeſtenZuDiſtilliren.	Description of the rendering of the page (repeated parts of book or chapter titles, redundant confidence texts).
pb_ana segmentation based on 'dipl' annotation value(s): Integer	Revision of the pagebreak (e.g. in case of apparently incorrect page numbers).
div1 - div5 segmentation based on 'dipl' annotation value(s): divINT	A subsection of the document. Nesting depth is made explicit by the number after div (INT) in the PAULA/relANNIS version
div1_type - div5_type segmentation based on 'dipl' annotation value(s): appendix book chapter description form herb names name nature parts_preparation_and_uses places place preface power reproduction season section species title time utensils	The type of section or subsection. Section can correspond to the entire "book", a "chapter" or smaller sections, including systematic types specific to the genre such as "place" (where a certain herb grows), "form" (descriptions of a herb's form) etc.
div2_n - div3_n segmentation based on 'dipl' annotation value(s): Integer	A numbered subsection (the 'n' annotation has the section number as a value, though this may also be a letter such as A or a subsection such as 1.1).
unclear segmentation based on 'dipl' annotation value(s): unclear	Unreadable or otherwise unclear text.
atLeast segmentation based on 'dipl' annotation value(s): Integer	Minimum presumed length of unclear text in characters.
atMost segmentation based on 'dipl' annotation value(s): Integer	Maximum presumed length of unclear text in characters.
interpretation segmentation based on 'dipl' annotation value(s): String	Suggestions for unreadable or unclear text.
figure segmentation based on 'dipl' annotation value(s): figure table	A graphic or table embedded in the original document.
figure_rend segmentation based on 'dipl' annotation value(s): drawingOfTwoJars drawingOfThreeJars drawingOfTwoGlasses drawingOfThreeGlasses drawingOfTwoAlembics drawingOfAnInstrument drawingOfAnEibisch drawingOfAStaubwurtz drawingOfAKamille drawingOfAHühnerdarm drawingOfAHelmet drawingOfAFilter drawingOfAWaldenburgischerKolben drawingOfAHaselwurtz drawingOfADrachenwurtz drawingOfAGauchheyl drawingOfADill drawingOfAHauswurz	Description of the rendering of a figure.
hi segmentation based on 'dipl' annotation value(s): hi	Highlighted area.
typeface segmentation based on 'dipl' annotation value(s): antiqua gothic gothicF gothicS mixed	Annotation of change of font, the main font of the annotated text is set as default value. gothicF = Gothic Fracture typeface (unsystematic, facultative information, subcategory of the value "gothic"); gothicS = Gothic Fracture typeface (unsystematic, facultative information, subcategory of the value "gothic").
hi_rend segmentation based on 'dipl' annotation value(s): antiqua bold end iniCap italics letter-spacing:1em red	Description of the rendering of the highlighted area.
head segmentation based on 'dipl' annotation value(s): head	A heading.
head_n segmentation based on 'dipl' annotation value(s): Integer	The number of a heading.
head_rend segmentation based on 'dipl' annotation value(s): brown	Description of the rendering of the heading.
note segmentation based on 'dipl' annotation value(s): note margin end	A note in the original document (e.g. footnotes, margins).
ref segmentation based on 'dipl' annotation value(s): ref	Reference to a footnote.
ref_target segmentation based on 'dipl' annotation value(s): #fINT	ID of the footnote being referred to.
ref_type segmentation based on 'dipl' annotation value(s): noteAnchor	Type of reference (e.g. a TEI "noteAnchor").
quote segmentation based on 'dipl' annotation value(s): quote	A quotation (in some documents only).
list segmentation based on 'dipl' annotation value(s): list	A list of items.
list_type segmentation based on 'dipl' annotation value(s): simple	The type of list used.
item segmentation based on 'dipl' annotation value(s): item	Item in a list.
xml_id segmentation based on 'dipl' annotation value(s): fINT	ID given to a footnote.

Content annotations

These annotations were developed by our students to annotate spans of tokens with properties of special interest.

Annotation layer and value(s)	Description
definition segmentation based on 'norm' annotation value(s): fig expl	A Definition of a term or description of a picture.
disease segmentation based on 'norm' annotation value(s): di	Mention of a disease, complete phrase.
term segmentation based on 'norm' annotation value(s): t h d j	A technical term, naming of a herb (h) or plant (p), naming of a disease (d).
author_ref segmentation based on 'norm' annotation value(s): author include other proin1sg pron1pl pron1sg pron2sg pron3sg self	References made by authors to themselves. Values indicate the grammatical type of the reference, e.g. "pron1pl" for first person plural pronoun.
reader_ref segmentation based on 'norm' annotation value(s): address adress pron1pl pron2pl pron2sg pron3sg reader	References made by authors to the reader. Values indicate the grammatical type of the reference, e.g. "pron2sg" for second person singular pronoun.
plant segmentation based on 'norm' annotation value(s): pl	Naming of a plant
property segmentation based on 'norm' annotation value(s): appearance cultivation effect preparation smell taste	Description of properties like appearance, smell, etc.
name segmentation based on 'norm' annotation value(s): name	A proper name (annotated only in some documents).
name_type segmentation based on 'norm' annotation value(s): flower gardener herb person plant publisher scholar tree	The type of proper name (e.g. "person", "herb").
referenz annotation value(s): String	This unsystematic layer is for referencing interpretations of all kind.
citation annotation value(s): String	This unsystematic layer marks citations (e.g. from the bible), and makes a diachronic comparison between passages possible, which originally based on identical lexical material.

Metadata

These annotations are loosely based on the TEI P5 guidelines. Furthermore you can find the complete corpus meta data in TEI p5 here: HANDLE ID. All meta data are annotated for each document.

Annotation layer and value(s)	Description
autor annotation value(s): String NA	Name of the author (if known).
bibl annotation value(s): String	Full bibliographical entry for the source including the page numbers annotated in the corpus.
datum annotation value(s): Integer	Date of publication, usually just the year (e.g. "1722").
verlag annotation value(s): String NA	Publisher of the document (if known).
ort annotation value(s): String NA	Publication place of the document.
titel annotation value(s): String	Title of the work the document was extracted from.
uebersetzer annotation value(s): String NA	Translator of the text, if existing.
uebersetztAus annotation value(s): it lat NA	Language from which the text was translated.
herausgeber annotation value(s): String NA	Editor of the text, if known..
version annotation value(s): String	Version of the corpus.
auflage annotation value(s): Erstauflage Nichterstauflage	Erstauflage: first edition of the text; Nichterstauflage: not the first edition of the text.
band annotation value(s): Integer NA	Volume of the text, if known.
bereich annotation value(s): Wissenschaft Alltag	Wissenschaft: the text is about scientific topics; Alltag: the text is about everyday topics.
thema annotation value(s): Al As B G K M R S	One or more topics per text are given. Additive value in alphabetical order of the abbreviations. Al: alchemy, As: astronomy, B: botany, G: gardening, K: kitchen, M: medicine, R: religion, S: linguistics. Example values: "B", "BM" oder "BKM".
register annotation value(s): Kraeuterkunde	Register of the text: Herbology.
einMehrspr annotation value(s): einsprachig mehrsprachig	mehrsprachig: the text is multilingual, which means that there are whole paragraphs written in another language than German (single translations of specialist terms do not count); einsprachig: the text is monolingual.
originaldatum annotation value(s): Integer NA	If a text is categorized as "Nichterstauflage" in "auflage", the original date of publication is given here (if known).
originalort annotation value(s): String NA	If a text is categorized as "Nichterstauflage" in "auflage", the original place of publication is given here (if known).
repositorium annotation value(s): URL	URL to the repository where you can find the facsimile of the text.
sprachtyp annotation value(s): mhd fnhd nhd	The language type is given. mhd: Middle High German; fnhd: Early New High German, nhd: New High German
sprachgebiet annotation value(s): md obd NA	The language area is given. md: Middle German, obd: High German. If a text is a later and more standardised one, the value "NA" is given.
textgestaltung annotation value(s): Prosa Poesie gemischt	Declaration of the general text composition. Prosa: the text is prosaic, Poesie: the text is poetic; gemischt: the text is partly poetic and partly prosaic.
gestaltungselemente annotation value(s): Endreim Endreim, Metrik NA	If in "textgestaltung" the values "Poesie" or "gemischt" are given, you can find here the specific poetic elements that are used. Endreim: end rhyme; Metrik: metrics
vorredeVorh annotation value(s): ja nein	ja: a preface is transcribed in a specific document; nein: no preface is transcribed in the document.
wermutVorh annotation value(s): ja nein	ja: there is a paragraph about the topic "Wermut" in this document; nein: there is no paragraph about the topic "Wermut" in this document.
kraeutermonographiesammlung annotation value(s): ja nein	ja: the document is a herbal monography collection, which means that different herbs are described in an ordered selection; nein: the document is no herbal monography collection
korpusdokumentation annotation value(s): URL	URL to the corpus documentation.

Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology