Dokumentation Version 2.0
Korpus-Pipeline
- Konstitution: Ridges v1 ohne "Flora francisca redidiva"
Eine vollständige Liste aller Dokumente dieser Version lässt sich unter Downloads finden. - Manuelle Korrektur der Transkription, der <clean>-Ebene und der Normalisierung.
- Wortartentagging und Lemmatisierung mit TreeTagger.
- Manuelle Korrektur von strukturellen und inhaltlichen Annotationen mit MS Excel
- Export des Korpus in persistente Formate und ins Such- und Visualisierungstool ANNIS.
Korpus-Design
Um Vergleichbarkeit zu gewährleisten, wählen wir Texte aus einer wissenschaftlichen Disziplin, die idealerweise auf ähnliche Weise im gesamten Untersuchungszeitraum vertreten ist. Für das erste RIDGES-Korpus haben wir den Bereich der Kräuterkunde gewählt. Der Untersuchungszeitraum wurde in 30-jährige Abschnitte unterteilt, mit derzeit einer Stichprobe pro Abschnitt. Da die Verarbeitung älterer Texte aufwendiger ist, variiert die Länge der Texte. Jedes Dokument umfasst ca. 4.000 bis 10.000 Wortformen.
Annotationsebenen
Die Annotationsebenen in den Korpora werden in einer Mehrebenenarchitektur gespeichert und lassen sich in vier Gruppen untergliedern.
- Transkription/Normalisierung
- Linguistische Annotationen
- Strukturelle Annotationen
- Inhaltliche Annotationen
- Metadaten
Transkription/Normalisierung
Diese Annotationen entsprechen immer genau einem Token. Part-of-speech-Annotationen (Wortarten) und Lemmatisierung wurden mit TreeTagger durchgeführt und von Hand korrigiert.
Annotationsebene und -wert(e) | Beschreibung |
---|---|
dipl Annotationswert(e):
|
The diplomatic transcription of the word form as found on the manuscript. |
clean Annotationswert(e):
|
Normalizations regarding graphical structures and special characters (e.g. "ſ" to "s"), but not modernized to Modern German orthography. For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form. |
norm Annotationswert(e):
|
A normalized word form based on Modern German orthography. Modern flexion is not normalized. |
Linguistische Annotationen
Annotationsebene und -wert(e) | Beschreibung |
---|---|
pos Annotationswert(e):
|
Part-of-speech annotation using the STTS tagset for German. |
lemma Annotationswert(e):
|
The normalized uninflected lexicon entry for each word form, using modern orthography (again, obsolete words are also modernized, e.g. beſchicht has the lemma beschehen, analog to geschehen) |
hyperlemma Annotationswert(e):
|
In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent is given as a hyperlemma (e.g. Heümonat is hyperlemmatized as Juli or ráß as beißend. |
foreign Annotationswert(e):
|
Non-german text. |
foreign_trans Annotationswert(e):
|
Translation from and to German. |
lang Annotationswert(e): |
Description of the target language and of the source language of a translation. |
Strukturelle Annotationen
Annotationsebene und -wert(e) | Description |
---|---|
lb Annotationswert(e):
|
Linebreak. |
brace Annotationswert(e):
|
Left or right parentheses marking text over multiple lines. |
brace_dir Annotationswert(e):
|
Direction of parentheses |
p Annotationswert(e):
|
A paragraph. |
p_n Annotationswert(e):
|
The number of a numbered paragraph (this may also be a letter such as A). |
p_rend Annotationswert(e):
|
Description of the rendering of the paragraph. |
pb Annotationswert(e):
|
Pagebreak. |
pb_n Annotationswert(e):
|
The number of the page (if marked explicitly). |
pb_rend Annotationswert(e):
|
Description of the rendering of the page (repeated parts of book or chapter titles, redundant confidence texts). |
pb_ana Annotationswert(e):
|
Analysis of the pagebreak (e.g. in case of apparently incorrect page numbers). |
div1 - div5 Annotationswert(e):
|
A subsection of the document. Nesting depth is made explicit by the number after div in the PAULA/relANNIS version |
div1_type - div5_type Annotationswert(e):
|
The type of section or subsection. Section can correspond to the entire "book", a "chapter" or smaller sections, including systematic types specific to the genre such as "place" (where a certain herb grows), "form" (descriptions of a herb's form) etc. |
div1_n - div5_n Annotationswert(e):
|
A numbered subsection (the n annotation has the section number as a value, though this may also be a letter such as A or a subsection such as 1.1) |
unclear Annotationswert(e):
|
Unreadable or otherwise unclear text |
atLeast Annotationswert(e):
|
Minimum presumed length of unclear text in characters |
atMost Annotationswert(e):
|
Maximum presumed length of unclear text in characters |
interpretation Annotationswert(e):
|
Suggestions for unreadable or unclear text |
figure Annotationswert(e):
|
A graphic embedded in the original document. |
figure_rend Annotationswert(e):
|
Description of the rendering of the figure. |
hi Annotationswert(e):
|
Highlighted area. |
hi_rend Annotationswert(e):
|
Description of the rendering of the highlighted area. |
head Annotationswert(e):
|
A heading. |
head_n Annotationswert(e):
|
The number of a heading. |
head_rend Annotationswert(e):
|
Description of the rendering of the heading. |
note Annotationswert(e):
|
A note in the original document (e.g. footnotes, margins). |
ref Annotationswert(e):
|
Reference to a footnote. |
ref_target Annotationswert(e):
|
ID of the footnote being referred to. |
ref_type Annotationswert(e):
|
Type of reference (e.g. a TEI "noteAnchor"). |
quote Annotationswert(e):
|
A quotation (in some documents only). |
list Annotationswert(e):
|
A list of items. |
list_type Annotationswert(e):
|
The type of list used. |
item Annotationswert(e):
|
Item in a list. |
xml_id Annotationswert(e):
|
ID given to a footnote. |
Inhaltliche Annotationen
Diese Annotationen wurden von unseren Studenten entwickelt, um Spannen von Token mit besonderen Eigenschaften auszuzeichnen.
Annotationsebene und -wert(e) | Description |
---|---|
definition Annotationswert(e):
|
A definition of a figure. |
term Annotationswert(e):
|
A technical term, naming of a herb or plant, naming of a disease |
property Annotationswert(e):
|
Describes a reference to properties of a herb such as effect, smell etc. |
reader_ref Annotationswert(e):
|
References made by authors to the reader. Values indicate the grammatical type of the reference, e.g. "pron2sg" for second person singular pronoun. |
author_ref Annotationswert(e):
|
References made by authors to themselves. Values indicate the grammatical type of the reference, e.g. "pron1pl" for first person plural pronoun. |
name Annotationswert(e):
|
A proper name (annotated only in some documents). |
name_type Annotationswert(e):
|
The type of proper name (e.g. "person", "herb"). |
Metadaten
Diese Annotationen folgen den TEI-P5-Richtlinien.
Annotationsebene und -wert(e) | Description |
---|---|
meta::author Annotationswert(e):
|
Name of the author (if known). |
meta::bibl Annotationswert(e):
|
Full bibliographical entry for the source including the page numbers annotated in the corpus. |
meta::date Annotationswert(e):
|
Date of publication, usually just the year (e.g. "1722"). |
meta::publisher Annotationswert(e):
|
Publisher of the document (if known). |
meta::pubPlace Annotationswert(e):
|
Publication place of the document. |
meta::title Annotationswert(e):
|
Title of the work the document was extracted from. |