BeMaTaC

Corpus Linguistics and Morphology | BeMaTaC

BeMaTaC

A deeply annotated multimodal map-task corpus of spoken learner and native German

Diese Webseite ist auch auf Deutsch verfügbar.

About

The Berlin Map Task Corpus (BeMaTaC) is a freely available corpus of spoken German. It consists of an L1 subcorpus recorded with native speakers of German and an identically designed L2 subcorpus with advanced speakers of German as a foreign language. BeMaTaC uses a map-task design, where one speaker (the instructor) instructs another speaker (the instructee) to reproduce a route on a map with landmarks. The speakers cannot see each other and are thus unable to communicate non-verbally. The dialogues are recorded with two separately placed microphones and a video showing the drawing hand of the instructee. Transcriptions are consistently tokenized, time-aligned and annotated on a wide and easily extendable range of different layers. Extensive and anonymized metadata are provided with every dialogue.

New release The current 3.0 release contains the L1 subcorpus with 12 dialogues (66 minutes total, 8900 normalized tokens) as well as the L2 subcorpus with 5 dialogues (77 minutes total, 9223 normalized tokens).


instructor	instructee

Access

BeMaTaC can be accessed using ANNIS, an open-source browser-based search and visualization tool for deeply annotated corpora.

Annotation

The current 2.1 / 2013-02.1 release contains the following layers:

Loosely orthographic transcription including fillers, truncations, colloquial contractions and idiosyncratic pronunciations
Normalized orthographic transcription
Automatically generated lemmatization
Automatically generated part-of-speech tags using the STTS (Stuttgart-Tübingen-TagSet)
Syntactically motivated utterance spans
Backchanneling (in the L1 subcorpus only the instructee's backchanneling)
Disfluencies: fillers (filled pauses), prolongations, mispronunciations, explicit editing terms and repetitions
Repairs: reparandum, interregnum, reparans
Repair subcategorizations: repetitions, substitutions, insertions
Extralinguistic events
Breaks (unfilled pauses)
Token length

The following data is available as part of the NoSta-D corpus:

Syntactic dependencies
Named entitiy recognition and disambiguation
Coreferences

We are currently working on the following annotations:

Automatic annotation of breaks, fillers and repetitions
Improved part-of-speech tagging by taking utterance spans into account
Semi-automatic normalization
Manually corrected part-of-speech tags (L1 subcorpus)

Long-term annotation plans:

Hyperlemma annotation for idiosyncratic lexical items
Manually corrected lemmatization
Manually corrected part-of-speech tags (L2 subcorpus)
Phonetic/phonological transcripton/annotation
Syntactic features
Information structure

Documentation

The following documents apply to the most current release, previous versions may contain data incompatible with these guidelines.

Download

BeMaTaC is licensed under a Creative Commons Attribution 3.0 Unported License.

If you are using our corpus for research or if you are planning on extending BeMaTaC with further annotations, please tell us about it.

L1 subcorpus: 2.1 / 2013-02.1 release

EXMARaLDA Partitur files (zip archive, 447 KB)
Audio files (WAVE) (zip archive, 565 MB)
Video files (QuickTime) (zip archive, 5.79 GB)
Video files (WebM) (zip archive, 147 MB)
Map files (zip archive, 32 MB)

L2 subcorpus: 2.1 / 2013-02.1 release

EXMARaLDA Partitur files (zip archive, 479 KB)
Audio files (WAVE) (zip archive, 525 MB)
Audio files (mp3) (zip archive, 70.2 MB)
Map files (zip archive, 22.8 MB)

Other releases

Syntactic dependencies, named entities and coreferences are available as part of the NoSta-D corpus.
Previous releases are available for download in the release history section of this website.

Team & Contact

If you have any questions or requests, please contact Simon Sauer.
Associated: Malte Belz, Oxana Rasskazova
Former members: Linda Giesel, Daisy Krüger, Elisabeth Lühr, Isabelle Nunberger, Myriam Klapi, Rosalia Schultze-Kraft, Melanie Siemund and Albina Töws

Publications

How to cite BeMaTaC

Please always cite this website and in the following form: http://u.hu-berlin.de/bematac
If mandated by your citation requirements, you may cite Simon Sauer as the primary editor.
If your citation requirements mandate a journal article, please cite
- Simon Sauer & Anke Lüdeling. 2016. Flexible Multi-Layer Spoken Dialogue Corpora. International Journal of Corpus Linguistics, Volume 21, Issue 3, 2016, Special Issue: Compilation, Transcription, Markup and Annotation of Spoken Corpora, 419–438.
In addition to the website, you may cite the following posters:
- Linda Giesel, Myriam Klapi, Daisy Krüger, Isabelle Nunberger, Oxana Rasskazova, Simon Sauer. 2013. Berlin Map Task Corpus – A deeply annotated multimodal map-task corpus of spoken learner and native German. DGfS-CL 2013.
  [http://korpling.german.hu-berlin.de/bematac/publications/Giesel-et-al_2013_DGfS-CL-2013.pdf]
- Simon Sauer & Oxana Rasskazova. 2014. BeMaTaC – eine digitale multimodale Ressource für Sprach- und Dialogforschung. Workshop Grenzen überschreiten – Digitale Geisteswissenschaft heute und morgen, Digital Humanities Berlin 2014.
  [http://korpling.german.hu-berlin.de/bematac/publications/Sauer-Rasskazova_2014_3WS-DHB.pdf]
When citing specific data from within the corpus, please refer to the subcorpus (L1 or L2), the corpus version (e.g. 2013-02.1), the specific document (e.g. 2011-12-14-A), and the token range as given in the tok layer.

2017

Malte Belz, Simon Sauer, Anke Lüdeling, Christine Mooshammer. 2017. Fluently disfluent? Pauses and Repairs of Advanced Learners and Native Speakers of German. International Journal of Learner Corpus Research, Volume 3, Issue 2, 2017, Special Issue: Segmental, Prosodic and Fluency Features in Phonetic Learner Corpora. 118-148. [https://doi.org/10.1075/ijlcr.3.2.02bel]

2016

Simon Sauer & Anke Lüdeling. 2016. Flexible Multi-Layer Spoken Dialogue Corpora. International Journal of Corpus Linguistics, Volume 21, Issue 3, 2016, Special Issue: Compilation, Transcription, Markup and Annotation of Spoken Corpora, 419–438. [pre-final version]

2015

Malte Belz, Simon Sauer, Anke Lüdeling, Christine Mooshammer. 2015. Repair Behaviour of Advanced German Learners in the Berlin Map Task Corpus. IFCASL Workshop on Phonetic Learner Corpora, satellite workshop of ICPhS2015, Glasgow, 12.08.2015.

Anke Lüdeling, Malte Belz, Hagen Hirschmann, Martin Klotz, Carolin Odebrecht, Laura Perlitz, Simon Sauer, Vivian Voigt. 2015. BeMaTaC, Falko, RIDGES. Linguistische Mehrebenenkorpora für Nichtstandard-Varietäten des Deutschen. Digital-Humanities-Tag 2015, Philosophische Fakultät II, Humboldt-Universität zu Berlin. [poster]

Simon Sauer. 2015. BeMaTaC: Ein tief annotiertes multimodales Map-Task-Korpus gesprochener Lerner- und Muttersprache. Gesprochene Fremdsprache Deutsch — Forschung und Vermittlung, Universidade de Lisboa, 26.—28.02.2015. [abstract]

2014

Malte Belz. 2014. Managing referential mismatches in German map task dialogues. RefNet Workshop, Edinburgh, 31.08.2014. [abstract]

Oxana Rasskazova, Simon Sauer, Christine Mooshammer. 2014. Berlin Dialog Corpus (BeDiaCo) – ein multimodales Korpus für Konvergenz- und Dialogforschung. Workshop Sprachdatenbanken – von der Aufnahme zur Publikation, CLARIN-D. [poster]

Simon Sauer & Oxana Rasskazova. 2014. BeMaTaC – eine digitale multimodale Ressource für Sprach- und Dialogforschung. Workshop Grenzen überschreiten – Digitale Geisteswissenschaft heute und morgen, Digital Humanities Berlin 2014. [poster]

Malte Belz. 2014. Repair disfluencies in German native and non-native speech. Linguistic Evidence 2014. [poster]

2013

Myriam Klapi. 2013. Disfluency Patterns: A Contrastive Corpus Study. Master's thesis. Humboldt-Universität zu Berlin, December 2013.

Malte Belz. 2013. Disﬂuencies und Reparaturen bei Muttersprachlern und Lernern – eine kontrastive Analyse. Master's thesis. Humboldt-Universität zu Berlin, November 2013. [online]

Oxana Rasskazova & Simon Sauer. 2013. BeMaTaC: ein multimodales Map-Task-Dialogkorpus. Pre-conference workshop Gesprochene Sprache und Sprachverarbeitung, GSCL 2013. [abstract]

Anke Lüdeling. 2013. Corpora of Spoken Language. Invited talk. From Hand to Mouth: A Dialogue between Spoken and Sign Language Research 2013. [slides]

Malte Belz & Myriam Klapi. 2013. Pauses following Fillers in L1 and L2 German Map Task Dialogues. Proceedings of Disfluency in Spontaneous Speech. DiSS 2013, 9–12. [online]

Clara Becker. 2013. Doing Backchanneling – Verhalten von Frauen und Männern beim Backchanneling im aufgabenorientierten Dialog. Bachelor's thesis. Humboldt-Universität zu Berlin, July 2013. [online]

Simon Sauer & Anke Lüdeling. 2013. BeMaTaC: A Flexible Multilayer Spoken Dialogue Corpus for Contrastive SLA Analyses. ICAME 34, 46–47. [abstract]

Linda Giesel, Myriam Klapi, Daisy Krüger, Isabelle Nunberger, Oxana Rasskazova, Simon Sauer. 2013. Gesprochene Muttersprache vs. Lernersprache – Aufbau und Auswertung eines Korpus. Forschendes Lernen an der Humboldt-Universität zu Berlin, 81–86. [online]

Linda Giesel, Myriam Klapi, Daisy Krüger, Isabelle Nunberger, Oxana Rasskazova, Simon Sauer. 2013. Berlin Map Task Corpus – A deeply annotated multimodal map-task corpus of spoken learner and native German. DGfS-CL 2013. [poster]

Teaching

A key aim of BeMaTaC is promoting the usage of corpora and teaching the necessary expertise. This is accomplished not only by using BeMaTaC data in linguistics courses but also by actively extending the corpus in class.

Winter term 2014/2015

TUT Nichtstandardvarietäten im Deutschen.
Julia Kostka, Pia Linscheid, Kristina Sommer, Humboldt-Universität zu Berlin.

Winter term 2013/2014

TUT Korpusdesign und gesprochene Sprache – BeMaTaC.
Oxana Rasskazova & Simon Sauer, Humboldt-Universität zu Berlin.

Summer term 2013

Q-TUT Berlin Map Task Corpus – Korpusdesign und gesprochene Sprache.
Myriam Klapi, Daisy Krüger, Isabelle Nunberger, Oxana Rasskazova, Simon Sauer, Humboldt-Universität zu Berlin.

Winter term 2012/2013

Q-TUT Gesprochene Muttersprache vs. Lernersprache – Aufbau und Auswertung eines Korpus.
Linda Giesel, Myriam Klapi, Daisy Krüger, Isabelle Nunberger, Oxana Rasskazova, Simon Sauer, Humboldt-Universität zu Berlin.
HS Corpus Annotation of Information Structure.
Kordula De Kuthy & Detmar Meurers, Eberhard Karls Universität Tübingen.

Winter term 2011/2012

SE Gesprochene Lernersprache.
Anke Lüdeling & Bernd Pompino-Marschall, Humboldt-Universität zu Berlin.

Tools & References

Original map-task design by HCRC
Anne H. Anderson, Miles Bader, Ellen Gurman Bard, Elizabeth Boyle, Gwyneth Doherty, Simon Garrod, Stephen Isard, Jacqueline Kowtko, Jan McAllister, Jim Miller, Catherine Sotillo, Henry Thompson & Regina Weinert. 1991. The HCRC Map Task Corpus. Language and Speech 34, 351–366.

Original corpus design based on HAMATAC
Thomas Schmidt, Hanna Hedeland, Timm Lehmberg & Kai Wörner. 2010. HAMATAC – The Hamburg MapTask Corpus. [online]

Maps courtesy of IDS Mannheim
Caren Brinckmann, Stefan Kleiner, Ralf Knöbl & Nina Berend. 2008. German Today: an areally extensive corpus of spoken Standard German. Proceedings 6th International Conference on Language Resources and Evaluation. LREC 2008. [online]

Automatic segmentation and alignment: MAUS
Florian Schiel, Christoph Draxler & Jonathan Harrington. 2011. Phonemic Segmentation and Labelling using the MAUS Technique. Workshop New Tools and Methods for Very-Large-Scale Phonetics Research. University of Pennsylvania, 2011, January, 28–31. [online]

Manual alignment and normalization: Praat
Paul Boersma. 2010. Praat, a system for doing phonetics by computer. Glot International 5 (9/10), 341–345.

Annotation and metadata: EXMARaLDA
Thomas Schmidt & Kai Wörner. 2009. EXMARaLDA – Creating, analysing and sharing spoken language corpora for pragmatic research. Pragmatics (19:4), 565–582.

Lemmatization and part-of-speech tagging: TreeTagger
Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing. [online]

Part-of-speech tagset: STTS
Anne Schiller, Simone Teufel, Christine Stöckert & Christine Thielen. 1999. Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). [online]

Converter framework: SaltNPepper
Florian Zipser & Laurent Romary. 2010. A model oriented approach to the mapping of annotation formats using standards. Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. [online]

Search and visualization interface: ANNIS
Amir Zeldes, Julia Ritz, Anke Lüdeling & Christian Chiarcos. 2009. ANNIS: A Search Tool for Multi-Layer Annotated Corpora. Proceedings of Corpus Linguistics 2009, July, 20–23. [online]

Last update: 24 September 2017

Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology