Research
Current projects
LAUDATIO aims to build an open access research data repository for historical linguistic data with respect to the above mentioned requirements of historical corpus linguistics. For the access and (re-)use of historical linguistic data the LAUDATIO repository uses a flexible and appropriate documentation schema with a subset of TEI customized by TEI ODD. The extensive metadata schema contains information about the preparation and checking methods applied to the data, tools, formats and annotation guidelines used in the project, as well as bibliographic metadata, and information on the research context (e.g. the research project). To provide complex and comprehensive search in the linguistic annotation data, the linguistic search and visualization tool ANNIS will be integrated in the LAUDATIO repository infrastructure.
http://www.laudatio-repository.org
ANNIS is an open source, versatile web browser-based search and visualization architecture for complex multilevel linguistic corpora with diverse types of annotation. ANNIS, which stands for Annotation of Information Structure, has been designed to provide access to the data of the SFB 632 "Information Structure: The Linguistic Means for Structuring Utterances, Sentences and Texts". Since information structure interacts with linguistic phenomena on many levels, ANNIS2 addresses the SFB's need to concurrently annotate, query and visualize data from such varied areas as syntax, semantics, morphology, prosody, referentiality, lexis and more. For projects working with spoken language, support for audio/video annotations is also required. http://corpus-tools.org/ |
|
Collaborative Research Centre 1412 "Register" |
The CRC Register: Language Users’ Knowledge of Situational-Functional Variation investigates aspects of the register knowledge of the speakers of a language. |
Laudatio |
The management and archiving of digital research data is an overlapping field for linguistics, library and information science (LIS) and computer science. These disciplines are cooperating in the LAUDATIO project. The name LAUDATIO is an abbreviation for Long term Access and Usage of Deeply Annotated Information. The project is funded by the German Research Foundation from 2011-2014. The departments of Corpus Linguistics as well as Historical Linguistics, and the Computer and Media Service (CMS) at Humboldt-Universität zu Berlin and The National Institute for Research in Computer Science and Control (INRIA France) are project partners cooperating with the Berlin School of Library and Information Science (BSLIS). LAUDATIO aims to build an open access research data repository for historical linguistic data with respect to the above mentioned requirements of historical corpus linguistics. For the access and (re-)use of historical linguistic data the LAUDATIO repository uses a flexible and appropriate documentation schema with a subset of TEI customized by TEI ODD. The extensive metadata schema contains information about the preparation and checking methods applied to the data, tools, formats and annotation guidelines used in the project, as well as bibliographic metadata, and information on the research context (e.g. the research project). To provide complex and comprehensive search in the linguistic annotation data, the linguistic search and visualization tool ANNIS will be integrated in the LAUDATIO repository infrastructure. http://www.laudatio-repository.org |
Hexatomic |
“A minimal infrastructure for the sustainable provision of extensible multi-layer annotation software for linguistic corpora” (Hexatomic) is a joint research project at Friedrich-Schiller-Universität Jena and Humboldt-Universität zu Berlin. https://hexatomic.github.io/ |
Mind Research Repository (MRR) |
The Mind Research Repository (MRR) provides access to publications along with data and scripts for analyses and figures reported in them. It is a further development of a project started as the Potsdam Mind Research Repository (PMR2) in August 2010.
|
With SaltNPepper we provide two powerful frameworks for dealing with linguistic annotated data. SaltNPepper is an Open Source project developed at the Humboldt University of Berlin. In linguistic research a variety of formats exists, but no common way of dealing with them. Therefore we developed a metamodel called Salt which abstracts over linguistic data. Salt is based on a general graph structure and treats linguistic data as sets of nodes and edges. Therefore it is highly usable in very different contexts of linguistic analysis Pepper is a pluggable framework which offers the possibility to plug-in new modules (using OSGi). The architecture of Pepper is flexible and makes it possible to benefit from already existing modules. |
|
<tiger2/> is an standard conformant XML format serializing the ISO SynAF model (ISO 24615:2010) for expressing syntactic annotation for a wide variety of theoretical formalisms and corpus architectures. It is closely related to and develops the ideas found in TigerXML (http://www.ims.uni-stuttgart.de/projekte/TIGER/). The format is conceived as theory neutral, as it is suited to both shallow and deep parsing in any number of theories and supports both pure constituency and dependency trees, as well as combinations of the two. For more information (schemas, API, etc.) see: http://korpling.german.hu-berlin.de/tiger2/ |
Resources
BeMaTaC |
The Berlin Map Task Corpus (BeMaTaC) is a freely available corpus of spoken German. It consists of an L1 subcorpus recorded with native speakers of German and an identically designed L2 subcorpus with speakers of German as a foreign language. BeMaTaC uses a map-task design, where one speaker (the instructor) instructs another speaker (the instructee) to reproduce a route on a map with landmarks. The dialogues are recorded with two separately placed microphones and a video showing the drawing hand of the instructee. Transcriptions are consistently tokenized, time-aligned and annotated on a wide and easily extendable range of different layers. Extensive and anonymized metadata are provided with every dialogue. |
DDBDeutsche Diachrone Baumbank |
The DDB (Deutsche Diachrone Baumbank) is a small (ca. 8000 tokens) deeply syntactically annotated corpus consisting of three subcorpora of different language periods of German (Old High German, Middle High German, Early New High German). The set up of the corpus mainly follows the TIGER-corpus, one of the largest freely accessible treebanks of German. DDB was developed within the project, supported by the Senate of Berlin, „Interdisciplinary research network linguistics – bioinformatics for the computation of kinship and descent”. Homepage: http://korpling.german.hu-berlin.de/ddb-doku/index.htmCorpus: http://korpling.german.hu-berlin.de/ddd/search.html |
Fairy tales corpus (Märchenkorpus) |
The fairy tales corpus contains 201 "Kinder- und Hausmärchen", and the 10 children legends (Kinderlegenden), which are printed in the second volume of the Brothers Grimm final edition. The corpus was designed, compiled and edited for the seminar "Drama pedagogy of fairy tales: Linguistics, Pedagogy and Theatre." The seminar, led by Maik Walter, took place in the summer term 2013 at the German Department of the University of Tübingen (see Maik Walter (in press): Es VERBte (ein)mal. Linguistisches Forschungstheater im Grimm-Jahr 2013. Zeitschrift für Theaterpädagogik 63. 29.Jahrgang. Themenheft: Forschung, Fachdiskurse & Labore). |
Falko is a freely available error-annotated learner corpus of German as a foreign language. http://www.linguistik.hu-berlin.de/institut/professuren/korpuslinguistik/research/falko/ |
|
KanDeL |
KanDeL (Kansas Developmental Learner corpus) is a freely available longitudinal learner corpus of beginning to intermediate learners of German as a foreign language, constructed at the University of Kansas by Nina Vyatkina http://www.linguistik.hu-berlin.de/institut/professuren/korpuslinguistik/research/kandel |
NoSta-D |
|
The RIDGES project (Register in Diachronic German Science) is an investigation into the development of the German scientific language in the early modern and modern periods, ranging from the mid 16th to the late 19th century. |
Networks
INDUS network |
Individualisiertes Sprachenlernen (als Gegenstück zu standardisierten Massenkursen) ist durch neuste Entwicklungen der Sprachtechnologie in greifbare Nähe gerückt. Somit lassen sich nicht nur die weit verbreiteten sondern auch „kleine“ Sprachen abdecken. Es zeigt sich jedoch, dass die Einbettung der Technologien in reale Lernsituationen viele neue Fragen aufwirft, die nur durch eine viele Disziplinen überspannende Forschungsanstrengung beantwortet werden können. Das INDUS-Netzwerk bringt dazu Akteure aus den Disziplinen Sprachtechnologie, Linguistik, Bildungsforschung, Lernpsychologie, Pädagogische Psychologie, Spracherwerbsforschung und Didaktik des Sprachenlernens zusammen, die sich im Kontext ihrer spezifischen Expertise bereits mit dem Lernen von Sprachen auseinandergesetzt haben. Gemeinsam werden konkrete Forschungsfragen bearbeitet, die sich vor allem auf die Aspekte der Individualisierung beziehen, z.B. zur Modellierung des Lerners, zur Anpassung des Lehrmaterials an verschiedene Lernausgangslagen wie Muttersprache und Vorwissen und zur Generierung von hilfreichen Rückmeldungen. |
Netzwerk Kobalt-DAF |
Annotation und Analyse argumentativer Lernertexte Konvergierende Zugänge zu einem schriftlichen Korpus des Deutschen als Fremdsprache http://www.kobalt-daf.de/ |
Finished projects and networks
LangBank |
The LangBank (Digital Infrastructure to Support the Study of Latin and Historical German) project is dedicated to the creation of a resource of annotated texts in Classical Latin and Histroical German. Access to a wide range of fully annotated texts is an important asset for research in humanities as well as for the acquisition of languages: While it is imperative for teachers and students to find texts adapted to both, the intended illustrational purpose and the learner's proficiency level, scholars are in need of accessing several texts with respect to specific language properties, such as grammatical constructions, vocabulary, spelling differences, etc. |
empirikon |
The network, which is funded by the German Research Foundation (DFG), combines skills from German Linguistics, Computer Linguistics, Computer Science and Psychology in order to achieve two goals: First, based on a set of concrete research questions, to compile suggestions for standards and the processing of linguistic data from German internet-based communication and, second, to develop methods and tools for their empirical computer-assisted analysis. The findings will be documented in publications, and the suggestions for standards and procedures will successively be provided online. http://www.empirikom.net |
Kompost |
Using methods from computational linguistics, this project will identify indicators of the quality of students’ texts in the German language. Special emphasis will be placed on the evolution of those quality indicators across competence levels, i.e. the development of observable parameter values over time as the students’ language skills improve. The study will be based on essays, test results, students’ attitudes and personal information from the city of Hamburg’s longitudinal KESS study, as well as material from other surveys. The core of this dataset is comprised of approximately 9000 essays which were rated along several dimensions. http://www.linguistik.hu-berlin.de/institut/professuren/korpuslinguistik/research/kompost |
This project seeks to systematically identify linguistic structures of German that pose a specific difficulty for the acquisition of German as a foreign language (GFL). Conventionally, this is done by observing learner errors (see Borin & Prütz 2004 or Westergren-Axelsson & Hahn 2001). However, if learners avoid difficult elements, this method fails. We claim that the relative underrepresentation of structures in learner data implies that these structures are difficult to acquire. Therefore, we propose a systematic study of underrepresented structures. http://www.linguistik.hu-berlin.de/institut/professuren/korpuslinguistik/research/learner-difficulties/WHIG-en |