Documentation Version 8.0

Corpus Linguistics and Morphology | Documentation Version 8.0

Documentation Version 8.0

You can find the more detailed annotation guidelines here (in German).

Corpus pipeline

2 additional texts were added:
SonderbaresKraeuterbuch-21-36_1675_Anonymous
ThesaurusSanitatis_304-321_1673_Nasser
You can find a complete list of all documents of this version in the annotation guidelines.
Manual transcription of the new texts.
Tokenization of the transcription with TreeTagger.
Manual preparation of the normalization in <norm>.
Correction of the <norm>-layer in all documents that were added to version 7 of the corpus.
The preface of NochEinigeWorte_1840_Meyen was separated from the main text and transformed into a new document named NochEinigeWorte-VR_1840_Meyen.
Due to overlapping text parts in ArtzneyBuchleinDerKreutter-Centaurea_1532_Tallat and ArtzneyBuchleinDerKreutter_1532_Tallat the redundant part was deleted in ArtzneyBuchleinDerKreutter-Centaurea_1532_Tallat and the file was renamed ArtzneyBuchleinDerKreutter-Cretanus_1532_Tallat.
Minor punctual corrections in all documents. Consistent correction of the <pb_n>-Annotation (distinction between Arabic and Roman numerals). All values "strD" in the layer <morph_ellipsis> were replaced by "morph_ellipsis" (in all texts that were published after 1652, because this correction step was only applied to the older texts in version 6).
Annotation of <figure> and <figure_p> in SonderbaresKraeuterbuch-1-11_1675_Anonymous and SonderbaresKraeuterbuch-11-21_1675_Anonymous.
The <comp_n_mod> annotation layers was added to all documents published until 1652 and already included in version 6 of the corpus.
Deletion of the following annotation layers in all documents: <figure_rend>, <item>, <nlp_morph>.
Part-of-speech tagging and lemmatization with TreeTagger-Batch and TreeTagger for all documents. Please note: Quotation marks can cause errors and need to be masked. Furthermore empty lines will be deleted by the TreeTagger. Fill those lines with a random tag (e.g. <9>) and use the option -sgml while tagging. Lines that include tags will not be tagged and can be deleted afterwards. After merging TreeTagger-output with the MS Excel file, the MS Excel macro SearchAndMerge (Readme) reconstruct the segmentation.
Automatic creation of <clean> for all documents (Python-Script and Readme).
The following anotation layers were added to the corpus in the CoNLL format (for all documents of version 6 that were published untl 1652): <deprel>, <morph>, <lemma> (in ANNIS & PAULA <lemma-deprel>), <pos> (in ANNIS & PAULA <pos-deprel>). They were prepared automatically with the Mate Tools. The annotations were corrected manually in the documents ContrafaytKreuterbuch_1532_Brunfels and HortulusSanitatis_1609_Uffenbach.
The following annotation layers were added to the corpus in the PTB format (for all documents of version 6 that were published untl 1652): <cat> (in ANNIS & PAULA <cat-const>), <edgelabel> (in ANNIS & PAULA <func>), <lemma> (in ANNIS & PAULA <lemma-const>), <pos> (in ANNIS & PAULA <pos-const>). They were transformed automatically from CoNLL to PTB using the Berkeley Parser.
All formats (Excel 2013, CoNLL, PTB) were converted to PAULA and ANNIS altogether using Pepper. The following importers were used: SpreadsheetImporter, PTBImporter, CoNNLImporter. Merging with Merger Module. Export with ANNISExporter and PAULAExporter.

Corpus design

In order to study the development of the scientific language throughout the period of interest, we require a subject domain that is sufficiently well represented in all subperiods. That is why we have selected the domain of herbology (Kräuterkunde). Texts vary somewhat in length since older text is more difficult to annotate.

Annotation layers

The RIDGES-Corpus consists of several annotations, which were created in dfferent formats. All of those formats can be downloaded here. Not every format contains every annotation and not all annotations are merged to PAULA and ANNIS. You can find a detailed documentation of every single format in the LAUDATIO-Repository.

The RIDGES corpus is designed as a multi-layer architecutre. Annotation layers can be roughly divided into five kinds:

Transcription/normalisation
Linguistic annotations
Structural annotations
Content annotation
Metadata

Transcription/normalisation

Annotation layer and value(s)	Description
dipl independent segmentation annotation value(s): Text	The diplomatic transcription of the word form as found on the manuscript. A Unicode-table with special character is used.
clean independent segmentation annotation value(s): Text	Automatic normalization by a Python-Script regarding graphical structures and special characters only (e.g. "ſ" to "s"). For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form. For words including line breaks notice, that if the second word begins with a capital letter, this letter will be normalized to a small letter in the clean layer (e.g. "Gelb- Sucht" to "Gelbsucht"). If all letters of the second word are capital letters, they will remain the same (e.g. "MON- TANUM" to "MONTANUM"). Dipl units containing vowels with macrons are replaced by each potential form of that token, separated by '\|' (for example: 'auſzwēdig' to: 'auszwemdig\|auszwendig'). For a full overview of the replacements for the clean-tier see the Readme.
norm independent segmentation annotation value(s): Text	In this layer the segmentation, graphemics, inflection forms and lexemes are normalized. Graphemics: orthographic normalization according to Duden (e.g. kreutter -> Kräuter); phonology: please notice the sound changes of the Early New High German period, like diphthongization, monophthongization, syncope, apocoke, etc. (e.g. lehret -> lehrt); morphology: in die Nasen -> in die Nase; lexicology: extinct lexical material is normalized according to modern orthography and described in the layer "erlaeuterung" as the case may be (e.g. Vergeſz -> Vergess); word formation: extinct morphemes are normalized - if possible - according to modern orthography (e.g. halben -> halber or stachelecht -> stachelig). Currently there are only some documents in which case was normalized.
ocr independent segmentation annotation value(s): Text	This layer was prepared for the new documents of version 7 only. For more information click here.

Linguistic annotations

Annotation layer and value(s)	Description
pos segmentation based on 'norm' annotation value(s): STTS	Autmatic part-of-speech annotation using the STTS tagset for German.
lemma segmentation based on 'norm' annotation value(s): String	Automatic lemmatization by TreeTagger.
comment segmentation based on 'dipl' annotation value(s): String	This is an unsystematic layer. In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent (e.g. Heümonat -> Juli) or an explanation can be given. This layer was originally called "hyperlemma" and was renamed as "erlaeuterung" in ridge-v5.
foreign segmentation based on 'dipl' annotation value(s): foreign	Non-german text.
foreign_trans segmentation based on 'dipl' annotation value(s): trans_to_german trans_from_german trans_from_german_extended trans_to_german_extended	Translation from and to German.
lang segmentation based on 'dipl' annotation value(s): ISO 639-2	The language of a foreign area is written in (ISO three letter codes according to ISO 639-2).
comp segmentation based on 'dipl' annotation value(s): k	Nominal compound (with nominal head).
comp_orth segmentation based on 'dipl' annotation value(s): zs gtr bs lb1 lb2	Annotation of the specific spelling of the annotated compounds (komp): zs: written together, gtr: written separately, bs: hyphenated (one line), lb1: separated by line break (hyphenless), lb2: separated by line break (hyphenated).
prot segmentation based on 'dipl' annotation value(s): prot1 prot2 prot3	Assigns prototypes to each compound of the komp-layer : prot1: reliably identifiable as compound, prot2: quite likely a compound und prot3: case of doubt (not assigned in (komp)).
comp_amb segmentation based on 'dipl' annotation value(s): a gpre	Annotation of word orders that are potential compounds. a: adjective-noun sequences in which the adjective is uninflected an which we would produce as compounds in modern German (e. g. das edel geſteine). However, there are a lot of AN sequences including uninflected adjectives that would be attributive adjektvies in modern German as well. These kinds of adjectives were annotated in <adja_uninfl>. gpre: noun-noun sequences that could be compounds as well as nominal phrases with a prenominal attributive genitive. For the annotated instances no clues of context or inflection were available to give information about the grammatical state of these constructions.
infl_fuge segmentation based on 'dipl' annotation value(s): y n yn NA	Annotation of inflectional elements respectively "Fugenelemente" in word orders that are potential compounds. y: There is an element between word1 and word2 (‚yes‘). n: There is no element between word1 and word2 (‚no‘). yn: Accounts for potential compounds with more than one lexical word parts. In this case there might be an element between two of the words (e.g . Jungkfrawen har). The order of "y" and "n" is strict in this tagset. NA: No information about the existence or non-existence of elements between word1 and word2 can be given, mostly because one of the words is semantically untransparent or one of the words origins from a foreign language (e.g. Latin).
comp_lex segmentation based on 'dipl' annotation value(s): lex n	lex: lexicalized compounds, that can not be used as a syntagm (anymore), because the addition of the meaning of the word parts is distinct from the compound meaning (affected topics: plant names, geographic terms, some diseases, body parts, animal species, star signs). Given a lexicalized compound has one or more additional parts that do not belong to the lexicalized part, the whole compound does not count as lexicalized, e.g. Eisenkrautsaft or Beifußblumen (vs. Johannisblumen), Blutwassersucht. This works similar for the word "Baum", which can be used to describe a plant more specificly (Kirsche -> Kirschbaum, Eiche -> Eichenbaum) and hence, does not belong to a lexicalized compound. But it also can be part of the lexicalized term, e.g. Schildkraut, Rutelkraut, Wunderbaum). n: This tag is used for all potential compounds that are not being interpreted as lexicalized given the definition above. You can find a list with individual decisions in the more detailed annotation guidelines.
comp_n segmentation based on 'dipl' annotation value(s): N A V ADV APPR CARD SUFF CONV X	Additive Tags that give information about the morphologic structure of nominal compounds, e.g. N_N for compounds that consist of two nouns. Suffixes are not incorporated unless suffixation was the last process that was realised (e.g. [[Kindbett]erin]). The tags for each lexical element are separated by underscores. Given the morphologic category can not be identified the palceholder "X" is set.
comp_n_mod segmentation based on 'dipl' annotation value(s): n art apprart adja piat pposat pdat card prelat NA	In this layer (inflectional) modifiers are annotated for compounds in the layer <comp_n>. Each compound gets a tag that describes the part of speech of the modifier(s) based on the STTS. For more than one modifier additive tags were assigned using underscores as separation between tags, e.g. 'art_adja'.
comp_n_graph segmentation based on 'dipl' annotation value(s): sep nospace hyph lb1 lb2 camel	The graphemics of the in <comp_n> annotated compounds is described in this layer. sep: seperated; nospace: written together; hyph: hyphenated; lb1: line break (without hyphen); lb2: line break (with hyphen); camel: camel case. Given compounds that consist of more than two lexical parts can be annotated with additive tags, e.g. „nospace_sep“ for Saurampffer waſſer.
comp_a segmentation based on 'dipl' annotation value(s): N A ADV CARD SUFF CONV farb	Additive Tags that give information about the morphologic structure of adjective compounds, e.g. A_A for compounds that consist of two adjectives. Suffixes are not incorporated unless suffixation was the last process that was realised. The tags for each lexical element are separated by underscores. farb: the last lexical element contains the root „farb“, e.g. himmelfarben.
comp_a_graph segmentation based on 'dipl' annotation value(s): sep nospace hyph lb1 lb2	The graphemics of the in <comp_a> annotated compounds is described in this layer. sep: seperated; nospace: written together; hyph: hyphenated; lb1: line break (without hyphen); lb2: line break (with hyphen).
attr_gen segmentation based on 'dipl' annotation value(s): gpre gpost	Annotation of nominal phrases with genitive attribute post or prenominal. gpre = prenominal genitive, gpost = postnominal genitive.
morph_ellipsis segmentation based on 'dipl' annotation value(s): morph_ellipsis	Coordination of compounds and parts of compounds (truncated morphemes and compounds such as: gelb⸗ und Waſſerſucht).
persname segmentation based on 'dipl' annotation value(s): String	Every name of a person to which the author of a particular document refers is annotated. For every instance the name of the person is given in the nominative form. There is a list of unified names now available in the annotation guidelines.
title segmentation based on 'dipl' annotation value(s): String	Every title of books to which the author of a particular document refers is annotated. For every instance the title of the book is given in the nominative form.
deprel segmentation based on 'norm' annotation value(s): -- AC AG AMS APP AVC CC CD CJ CM CP CVC DA DM EP JU MNR MO NG NK OA OA2 OC OG OP PAR PD PG PH PM PNC RC RE RS SB SBP SP SVP UC VO	Dependency annotationen based on the TIGER annotation scheme (prepared with Mate Tools). You can find the tagset in the more detailed annotation guidelines. The dependencies were corrected manually in the documents ContrafaytKreuterbuch_1532_Brunfels and HortulusSanitatis_1609_Uffenbach.
cat-const segmentation based on 'norm' annotation value(s): AA AP AVP CAP CAVP CH CNP CO CPP CS CVP CVZ DL ISU NP PN PP PSEUDO ROOT S TOP VP VZ	Constituency annotation based on the TIGER annotation scheme (automatically transformed out of the dependency parses of the Mate Tools using the Berkeley Parser). You can find the tagset in the more detailed annotation guidelines.
func AC AG AMS APP AVC CC CD CJ CM CP CVC DA DH DM EP HD JU MNR MO NG NK OA OA2 OC OG PAR PD PG PH PM PNC RC RE RS SB SBP SVP UC VO	Edgelanel annotation based on the TIGER annotation scheme (automatically transformed out of the dependency parses of the Mate Tools using the Berkeley Parser). You can find the tagset in the more detailed annotation guidelines.
morph segmentation based on 'norm' annotation value(s): sg/pl neut/masc/fem nom/gen/dat/acc 1/2/3 pres/past pos/comp/sup ind/subj *	Morphologic annotation with additive tags including information about case\| number\| gender\| tense\| comparison\| voice (depending on the part of speech). sg/pl: singular/ plural; neut/ masc/ fem: neuter/ masculine/ feminine; nom/ gen/ dat/ acc: nominative/ genitive/ dative/ accusative; 1/ 2/ 3: 1st/ 2nd/ 3rd person; pres/ past: present/ past; pos/ comp/ sup: positiv/ comparative/ superlative; ind/ subj: indicative/ subjunktive; *: placeholder.
cat segmentation based on 'norm' annotation value(s): S	Span annotation, that was created during the conversion process from CoNLL to ANNIS.
lemma-deprel segmentation based on 'norm' annotation value(s): normalized Lemma	Lemmatization with Mate Tools. The annotations were corrected manually in the documents ContrafaytKreuterbuch_1532_Brunfels and HortulusSanitatis_1609_Uffenbach.
pos-deprel segmentation based on 'norm' annotation value(s): $, $. $LRB ADJA ADJD ADV APPO APPR APPRART APZR ART CARD FM ITJ KOKOM KON KOUI KOUS NE NN PDAT PDS PIAT PIS PPER PPOSAT PPOSS PRELAT PRELS PRF PROAV PTKA PTKANT PTKNEG PTKVZ PTKZU PWAT PWAV PWS TRUNC VAFIN VAPP VMFIN VMINF VVFIN VVIMP VVINF VVIZU VVPP XY	Part-of-speech tagging with Mate Tools. The annotations were corrected manually in the documents ContrafaytKreuterbuch_1532_Brunfels and HortulusSanitatis_1609_Uffenbach.
pos-const segmentation based on 'norm' annotation value(s): $, $. $*LRB ADJA ADV APPO APPR APPRART APZR ART CARD FM ITJ KOKOM KON KOUI KOUS NE NN PDAT PDS PIAT PIS PPER PPOSAT PRELAT PRELS PRF PROAV PTKA PTKNEG PTKVZ PTKZU PWAT PWAV PWS TRUNC VAFIN VAINF VAPP VMFIN VMINF VVFIN VVIMP VVINF VVIZU VVPP XY	Part-of-speech tagging with Mate Tools and conversion into PTB-Format.
adja_uninfl segmentation based on 'norm' annotation value(s): uninfl	Annotation of uninflected adjectives that are positioned directly in front of a noun. Given there are several uninflected adjectives in front of the same noun only the last one is annotated.
form_disease segmentation based on 'dipl' annotation value(s): deriv derivat kompNN kompNNgetrennt lat phrase Phrase phraseDasIst phraseGen phraseGEN phraseGenannt phraseHS phraseRS phraseSubj phraseV1 phraseVP simplex wort	NA
ppk_e1 - ppk_e3 segmentation based on 'dipl' annotation value(s): ppk ppk_e2 ppk_e3 zwf ppk_rek	Prepositional constructions (prepositional attributive constructions or attributive adverbial phrases) are annotated. ppk: normal prepositional construction ppk_e2: normal ppk inside a ppk in the layer <ppk_e1> ppk_e3: normal ppk inside a ppk in the layer <ppk_e2> zwf: case of doubt ppk_rek: recursive (nested) ppk attr_X: attributes, that refer to an element inside a ppk without linking to that inside a syntactic sequence. X is a placeholder for the respective referent.
problem segmentation based on 'dipl' annotation value(s): String	NA
herbname_norm segmentation based on 'dipl' annotation value(s): String	In this layer a systematic herbal name is given. Sometimes it is ambigous - in this case you can find additional information in the "erlaeuterung" or in the "bemerkung_lexik" layer.
herbprep segmentation based on 'dipl' annotation value(s): String	This layer was made for the identification of preparations of herbs. Only those instances are included which are NPs or modifiers with a herb as head. The name is given in the nominative singular form and normalized according to modern orthography. Whitespaces are replaced by underscores. Compounds are always written together, regardless of their compound spelling in the facsimile. Everything is written in lower case letters (e.g. safft des weremuts -> saft_des_wermuts.
form_prep segmentation based on 'dipl' annotation value(s): kompNN kompNNgetrennt phraseVon phraseGen	In this layer preparations with herbs are described syntactically or morphologically. kompNN = NN compounds which are written together or hyphenated; kompNNgetrennt = nouns following each other which could be a compound (written seperatlely); phraseVon = preparations with herbs containing a von-PP (e.g. safft von weremut); phraseGen = preparation with herbs containing a genitive attribute (e.g. safft des weremuts.
noun_nom segmentation based on 'dipl' annotation value(s): String	In this layer all nouns which are included in the text are given, namely in the first occurring spelling in the nominative singular form. If the first occurring form of "Saft" is safft, all further incidences of "Saft" are given as safft. Everything is written in lower case letters. The purpose of this layer is to investigate the variation of noun spelling within one text.
form_noun segmentation based on 'dipl' annotation value(s): simplex kompNN kompNNgetrennt kompNEN kompNENgetrennt kompNNNgetrennt kompAN kompVN derivat nom gri lat lex	In this layer all nouns are were morphologically annotated. kompNN = NN compound, written together or hyphenated; kompNNgetrennt = all sequences of two nouns which could be compounds, but are written separately; kompNEN = NE-N compound, written together or hyphenated; kompNENgetrennt = all sequences of NE and N which could be compounds, but are written separately; kompNNNgetrennt = all sequences of three nouns which could be compounds, but are written separately; kompAN = AN compounds; kompVN = VN compounds; ; derivat = derivates; nom= implicite nominalisation (conversion, ablaut, syntactic, nominalisation); gri/lat/ara = clear Greek/Latin/Arabic nouns, already in the German language integrated foreign material is treated like native words; lex = lexicalized herb names which were originally morphologically complex, e.g. Beifuß, Wermut, Stabwurz, and tausend guldin for "Tausendguldenkraut".
comment_lex segmentation based on 'dipl' annotation value(s): String	This is an unsystematic layer for comments and questions about lexis.
clause_type segmentation based on 'dipl' annotation value(s): rs padv rsx rsdem padvpart dem part	Annotation of clause types. No hierarchical annotation. For nested sentences only the highest clause is annotated. In the layer "bemerkungen_syntax" you can find notes about the nestings. rs = clear relative clauses, both "w-relative clauses" and "d-relative clauses"; padv = clauses which are introduced by a pronominal adverb; rsx = relative clauses without main clause (this often occurs in headlines); rsdem = ambiguous cases: relative clause or demonstrative clause; padvpart = clauses with pronominal adverb and participle; dem = demonstrative clauses (all clauses with a demonstrative pronoun as subejct); part = participles that behave similarly like relative clauses.
position_rel segmentation based on 'dipl' annotation value(s): vor nach int	Position of the relative clause within the main clause. vor = preposed; nach = postposed; int = embedded.
position_referent segmentation based on 'dipl' annotation value(s): adja-v adja-n dist na	Position of the relative clause relative to the reference category. adja-v = adjacently preposed; adja-n = adjacently postposed; dist = distant; na = not applicable.
form_referent segmentation based on 'dipl' annotation value(s): np d-pron p-pron null	Form of the reference category of the relative clause. np = non-pronominal NP; d-pron = der, die, das, dieser, etc.; p-pron = personal pronoun; null = for free relative or asyndetic relative clauses with a covert correlate in the main clause).
position_verb_rel segmentation based on 'dipl' annotation value(s): v2 ve venf	Verb position within the relative clause. v2 = verb second; ve = verb end; venf = verb end with occupied postfield.
form_relpron segmentation based on 'dipl' annotation value(s): d-pron w-pron w-phras	Form of the category which introduces the relative clause.d-pron = all d-pronouns; w-pron = wer, welch-; w-phras = e.g. welch frau
mod_referent segmentation based on 'dipl' annotation value(s): relsatz d-pron m-padv m-part np	relsatz = Annotated on pronouns, NPs or clauses, if modified by a relative clause. Not applicable for free relative clauses. The whole reference category is annotated as span.d-pron/m-padv, m-part, np = NA.
position_verb segmentation based on 'norm' annotation value(s): V2 Vletzt V? V1	Verbposition.V2: Verb second position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. Vletzt: Verb final position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. V?: Unclear verb position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS.V1: Verb first position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS.
subclause_type segmentation based on 'norm' annotation value(s): Adverbial Attribut Komplement	Type of subordinating clause. Adverbial: Adverbial function of a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. Attribut: Attributive function of a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. Komplement: Complement function of a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS.
KOUS_sem segmentation based on 'norm' annotation value(s): additiv final k.a. kausal konditional konsekutiv konzessiv modal temporal 0	KOUS_Semantik. additiv: Additive semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. final Final semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. k.a.: Non analyzable semantics of subordinated conjunction, due to complement status of subordinated clause; analyzed at occurrences of pos=KOUS.kausal Causal semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. konditional: Conditional semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. konsekutiv: Consecutive semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. konzessiv: Concessive semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. modal: Modal semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. temporal: Temporal semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS; 0: NA.
sentence_end segmentation based on 'dipl' annotation value(s): S	Sentence endings are annotated. You can find the detailed annotation guidelines here.

Structural annotations

Annotation layer and value(s)	Description
lb segmentation based on 'dipl' annotation value(s): lb	Linebreak.
pb segmentation based on 'dipl' annotation value(s): pb	Pagebreak.
pb_n segmentation based on 'dipl' annotation value(s): Integer or Letter	The number of the page (if marked explicitly).
pb_ana segmentation based on 'dipl' annotation value(s): Integer	Revision of the pagebreak (e.g. in case of apparently incorrect page numbers).
unclear segmentation based on 'dipl' annotation value(s): unclear	Unreadable or otherwise unclear text.
atLeast segmentation based on 'dipl' annotation value(s): Integer	Minimum presumed length of unclear text in characters.
atMost segmentation based on 'dipl' annotation value(s): Integer	Maximum presumed length of unclear text in characters.
interpretation segmentation based on 'dipl' annotation value(s): String	Suggestions for unreadable or unclear text.
figure segmentation based on 'dipl' annotation value(s): figure table	A graphic or table embedded in the original document.
figure_p segmentation based on 'dipl' annotation value(s): integer	Annotation of the original page in the facsimile on which a figure is printed.
quote segmentation based on 'dipl' annotation value(s): yes no	dipl-tokens that are part of a quote are annotated with "yes". The default value is "no".
column segmentation based on 'dipl' annotation value(s): l r	Annotation of all dipl-units that belong to one column on a page. This annotation is only annotated when there is more than one column per page. l: left; r: right.
hi segmentation based on 'dipl' annotation value(s): hi	Highlighted area.
script segmentation based on 'dipl' annotation value(s): blackletter roman mixed	Annotation of change of font.
hi_rend segmentation based on 'dipl' annotation value(s): bold iniCap italics letter-spacing:1em red underlined	Description of the rendering of the highlighted area. iniCap: decorated initial capital (mostly at the beginning of a new chapter); letter-spacing:1em: spaced letters.
head segmentation based on 'dipl' annotation value(s): head	A heading.
note segmentation based on 'dipl' annotation value(s): note margin end	A note in the original document (e.g. footnotes, margins).
ref segmentation based on 'dipl' annotation value(s): ref	Reference to a footnote.
ref_target segmentation based on 'dipl' annotation value(s): #fINT	ID of the footnote being referred to.
ref_type segmentation based on 'dipl' annotation value(s): noteAnchor	Type of reference (e.g. a TEI "noteAnchor").

Content annotations

These annotations were developed by our students to annotate spans of tokens with properties of special interest.

Annotation layer and value(s)	Description
definition segmentation based on 'norm' annotation value(s): fig expl
author_ref segmentation based on 'norm' annotation value(s): author pron1sg pron1pl pron2pl pron3sg	References made by authors to themselves. Values indicate the grammatical type of the reference, e.g. "pron1pl" for first person plural pronoun.
reader_ref segmentation based on 'norm' annotation value(s): pron1pl pron2pl pron2sg pron3sg reader author	References made by authors to the reader. Values indicate the grammatical type of the reference, e.g. "pron2sg" for second person singular pronoun.
plant segmentation based on 'norm' annotation value(s): pl	Naming of a plant
property segmentation based on 'norm' annotation value(s): appearance cultivation effect preparation smell taste	Description of properties like appearance, smell, etc.
name segmentation based on 'norm' annotation value(s): name	A proper name (annotated only in some documents).
name_type segmentation based on 'norm' annotation value(s): flower gardener herb person plant publisher scholar tree	The type of proper name (e.g. "person", "herb").
reference annotation value(s): String	This unsystematic layer is for referencing interpretations of all kind.

Metadata

These annotations are loosely based on the TEI P5 guidelines. Furthermore you can find the complete corpus meta data in TEI p5 here: HANDLE ID. All meta data are annotated for each document.

Annotation layer and value(s)	Description
author annotation value(s): String NA	Name of the author (if known).
bibl annotation value(s): String	Full bibliographical entry for the source including the page numbers annotated in the corpus.
date annotation value(s): Integer	Date of publication, usually just the year (e.g. "1722").
publisher annotation value(s): String NA	Publisher of the document (if known).
place annotation value(s): String NA	Publication place of the document.
title annotation value(s): String	Title of the work the document was extracted from.
translator annotation value(s): String NA	Translator of the text, if existing.
trans_from annotation value(s): it lat NA	Language from which the text was translated.
editor annotation value(s): String NA	Editor of the text, if known..
version annotation value(s): 1.0 2.0 3.0 4.0 5.0 6.0 7.0	Version in which the specific document was added to the corpus.
edition_first annotation value(s): yes no	Erstauflage: first edition of the text; Nichterstauflage: not the first edition of the text.
issue annotation value(s): Integer NA	Volume of the text, if known.
maintopic annotation value(s): science non-science	science: the text is about scientific topics; non-science: the text is about everyday topics.
topic annotation value(s): Al As B G K L M R	One or more topics per text are given. Additive value in alphabetical order of the abbreviations. Al: alchemy, As: astronomy, B: botany, G: gardening, K: kitchen, L: linguistics, M: medicine, R: religion. Example values: "B", "BM" oder "BKM".
register annotation value(s): herbology	Register of the text: Herbology.
lingualism annotation value(s): monoling multiling	mehrsprachig: the text is multilingual, which means that there are whole paragraphs written in another language than German (single translations of specialist terms do not count); einsprachig: the text is monolingual.
orig_date annotation value(s): Integer NA	If a text is categorized as "Nichterstauflage" in "auflage", the original date of publication is given here (if known).
orig_place annotation value(s): String NA	If a text is categorized as "Nichterstauflage" in "auflage", the original place of publication is given here (if known).
repository annotation value(s): URL	URL to the repository where you can find the facsimile of the text.
lang_type annotation value(s): fnhd nhd	The language type is given. fnhd: Early New High German, nhd: New High German
lang_area annotation value(s): md obd NA	The language area is given. md: Middle German, obd: High German. If a text is a later and more standardised one, the value "NA" is given.
text_type annotation value(s): prose lyric mixed	Declaration of the general text composition. Prosa: the text is prosaic, Poesie: the text is poetic; gemischt: the text is partly poetic and partly prosaic.
lyric_type annotation value(s): end_rhyme meter rhyme_meter NA	If in "textgestaltung" the values "Poesie" or "gemischt" are given, you can find here the specific poetic elements that are used. Endreim: end rhyme; Metrik: metrics
wormwood annotation value(s): yes no	yes: there is a paragraph about the topic "Wermut" in this document; no: there is no paragraph about the topic "Wermut" in this document.
herb_sorting annotation value(s): yes no	yes: the document is a herbal monography collection, which means that different herbs are described in an ordered selection; no: the document is no herbal monography collection
deprelGold annotation value(s): yes no	yes: the dependency parses in <deprel> and the corresponding lemmatization and pos-tagging in <lemma-deprel> and <pos-deprel> were corected manually; no: the dependency parses in this document were not corrected.

Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology