Faculty of Language, Literature and Humanities - RUEG

RUEG corpus

The corpus

Survey methods

Languages and speakers

Data processing

Tools and editors

Subcorpora and size

Access to the corpus

Information on working with the RUEG corpus and ANNIS

Notes on citation

 

The corpus

The RUEG corpus contains parallel, yet naturalistic data from bilingual and monolingual speakers in English, German, Greek, Russian and Turkish. It supports systematic comparisons across languages, countries and societies, heritage and majority language contexts, mono- vs. bilingual speaker groups, age groups and communicative situations (formal and informal, spoken and written).

The corpus was developed between 2018 and 2024 within the context of the research unit "Emerging Grammars in Language-Contact Situations" (speaker: Heike Wiese, Humboldt University of Berlin). It was created to identify and compare nonstandard phenomena in different language contact constellations.

For collecting, processing, and publishing the data, we obtained relevant ethic and data handling approvals from Universität Potsdam, Humboldt-Universität zu Berlin, the German Society of Linguistics (DGfS), the Greek Ministry of Education, Religious Affairs and Sports and the Turkish Ministry of National Education. The US-based data collection was conducted under the auspices of the University of Maryland (Institutional Review Board, IRB 766233-4).

The Corpus is available via open access.

 

Survey methods

The data was collected with the "LangSit" method , a method to elicit naturalistic and ecologically valid, yet controlled linguistic productions that capture speakers' repertoires across formal and informal communicative situations. Speakers saw a video of a traffic accident, imagined they had just witnessed it in real life and then described it in four different fictitious situations:

  • formal-spoken: voice message to the police with verbal witness report
  • formal-written: written witness report for the police (digital, typed on laptop)
  • informal-spoken: voice message (via WhatsApp) to a friend
  • informal-written: text message (via WhatsApp) to a friend.

For each speaker, we collected meta-data trough a questionnaire convering, inter alia, (language-)biographical data, language use, media usage and personal character traits.

RUEG provides open access to all aspects of the elicitation, including all stimuli, elicitation tools, and an instructional video for training elicitors, via our OSF page.

The templates of our consent forms can be found via this link.

 

Languages and speakers

Data was collected in different languages and language contact constellations in five countries:

 

USA (= English as majority language)

    • monolingual speakers (only English spoken regularly in the family): English data
    • multilingual speakers with German as heritage language: English and German data
    • multilingual speakers with Greek as heritage language: English and Greek data
    • multilingual speakers with Russian as heritage language: English and Russian data
    • multilingual speakers with Turkish as heritage language: English and Turkish data

 

Germany (= German as majority language)

    • monolingual speakers (only German spoken regularly in the family): German data
    • multilingual speakers with Greek as heritage language: German and Greek data
    • multilingual speakers with Russian as heritage language: English and Greek data
    • multilingual speakers with Turkish as heritage language: English and Turkish data

 

Greece (= Greek as majority language spoken regularly in the family):  data

    • monolingual speakers (only Greek spoken regularly in the family): Greek data

 

Russia (= Russian as majority language)

    • monolingual speakers (only Russian spoken regularly in the family): Russian data

 

Turkey (= Turkish as majority language)

    • monolingual speakers (only Turkish spoken regularl in the family): Turkish data
    • multilingual speakers with Kurdish as heritage language: Turkish data

 

Speakers come from two age groups:

  • adults: 20 to 37 years of age
  • adolescents: 13 to 19 years of age

 

Data processing

Spoken productions were recorded and transcribed (formal- and informal-spoken data); written productions were saved as plain text files (formal-written data) or exported (informal-written data: WhatsApp messages). The data is organised in CUs (communicative units), understood as an "independent clause with its modifiers" (Loban 1976:  9).

The corpus is lemmatised and contains several annotation layers:

  • normalisation layer
  • part-of-speech (POS) tagging
  • dependency tags for all languages, prosodic annotations for English and Russian data and hierarchical topological fields for German data
  • referent annotations for parts of the data
  • morphological annotations (case, gender, number, person, German complex verbs)

The corpus is gradually being expanded and versioned through improved and additional annotations.

The data was processed such that all annotations and meta-data are searchable in ANNIS.

Apart from the ANNIS data format, the corpus is is available in the following formats, depending on the annotation:

 

format

annotations

EXMARaLDA-XML

span and token annotations (diplomatic and normalized tokens, communicative units, language, morphological annotations, referent annotations)

 

PRAAT TextGrid

transcriptions (all), prosodic annotations (English and Russian data)

CoNLL-U

automatic dependency parses, manually corrected lemmatization as well as universal and language-specific part of speech

PTB

Hierarchical topological fields (German data)

Transcription guidelines:

  • Guidelines will be published here soon.

All data was anonymised. Speaker names were substituted with speaker codes.

All file names contain speaker codes and abbreviations of communicative situations, providing the following information:

 

RUEG2siglenGREEK-001.jpg
 

Tools and editors

The follwing tools were involved in creating and publishing the corpus (in alphabetic oder):

 

Tool

Purpose

ANNIS (Krause et al. 2016)

Corpus search and visualization

 

EXMARaLDA (Schmidt & Wörner 2014)

Annotation

MyStem Tagger (Segalovich 2003)

POS-tagging and lemmatization

PRAAT (Boersma & Weenink 2019)

Transcription, annotation of prosody

Salt'n'Pepper (Zipser & Romary 2010)

Modelling and data conversion

TreeTagger (Schmid 1999) 

POS-tagging and lemmatization

UDPipe (Straka & Straková 2017)

 

POS-Tagging and dependency parsing

Subcorpora and size

The RUEG corpus consists of five subcorpora, one for each language (DE - German; EN - English; EL - Greek; RU - Russian; TR - Turkish).

Altogether, the current corpus version (0.4.0) contains linguistic productions  from 720 speakers, of which 349 are adults and 371 are adolescents.

 

RUEG 0.4:

lang

norm

cus

speakers

DE

~ 164.000

~ 21.000

260

EL

~ 75.000

~ 7.000

167

EN

~ 20.000

~ 19.082

287

RU

~ 93.000

~ 12.000

193

TR

~ 67.000

~ 13.000

188

 

Data distribution over subcorpora (note that bilingual speakers have provided data in two languages and hence will appear twice in the list):

Teilkorpus

Majority Language

Speaker

Adults

Adolescent

Total

RUEG-DE

German

monolingual (German)

31

33

64

 

 

bilingual (h-Greek)

26

18

44

 

 

bilingual (h-Russian)

29

28

57

 

 

bilingual (h-Turkish)

33

32

65

 

English

bilingual (h-German)

7

23

30

RUEG-EN

English

monolingual (English)

bilingual (h-German)

32

7

32

27

64

34

 

 

bilingual (h-Greek)

32

32

64

 

 

bilingual (h-Russian)

33

33

66

 

 

bilingual (h-Turkish)

27

32

59

RUEG-EL

German

bilingual (h-Greek)

27

20

47

 

English

bilingual (h-Greek)

24

32

56

 

Greek

monolingual (Greek)

32

32

64

RUEG-RU

German

bilingual (h-Russian)

30

28

58

 

English

bilingual (h-Russian)

33

35

68

 

Russian

monolingual (Russian)

33

34

67

RUEG-TR

German

bilingual (h-Turkish)

33

32

65

 

English

bilingual (h-Turkish)

27

30

57

 

Turkish

monolingual (Turkish)

32

34

66


 

ANNIS Grafik.png

Access to the corpus

The RUEG corpus is openly accessible via the browser-based application ANNIS. ANNIS is an open source search and visualisation architecture for multi-layer corpora, developed at Humboldt University of Berlin, Georgetown University, and Potsdam University. It can be used to search for complex graph structures of annotated nodes and edges forming a variety of linguistic structures, such as constituent or dependency syntax trees, coreference, rhetorical structure and parallel alignment edges, span annotations and associated multi-modal data.

The current version of the corpus can also be fully downloaded via Zenodo.

 

Information on working with the RUEG corpus and ANNIS

Search queries in ANNIS are formulated in the “ANNIS Query Language(AQL).

Tutorials for the formulation of search queries and the operation of ANNIS are available here.

Further information can be found in the ANNIS User Guide.

If you have questions specifically about ANNIS, you can also directly contact the ANNIS-Team via mail.

 

Notes on citation

When you use the corpus, please cite as

Wiese, Heike, Alexiadou, Artemis, Allen, Shanley, Bunk, Oliver, Gagarina, Natalia, Iefremenko, Kateryna, Esther, Jahns, Klotz, Martin, Krause, Thomas, Labrenz, Annika, Lüdeling, Anke, Martynova, Maria, Neuhaus, Katrin, Pashkova, Tatiana, Rizou, Vicky, Tracy, Rosemarie, Schroeder, Christoph, Szucsich, Luka, Tsehaye, Wintai, Zerbian, Sabine, Zuban, Yulia. (2019). RUEG Corpus (Version 0.2.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3236069

 


 

Researchers that have been or are currently involved in RUEG and in the development of the RUEG corpus:

 

Artemis Alexiadou, Daria Alkhimchenkova, Shanley Allen, Chris Allison, Christian Anders, Simar Aybar, Yesim Bayram, Ricarda Bothe, Marlene Böttcher, Nina Bredereck, Marvin Brink, Olga Buchmüller, Oliver Bunk, Ryan Carroll, Franziska Cavar, Büşra Çiçek, Lea Coy, Claudia Czarniac, Leah Doroski, Sabine Eisele (Zerbian), Mary Elliott, Uğur Erdem, Natalia Gagarina, Yağmur Gök, Sabine Hainsfurth, Gajaneh Hartz, Luc Henriquez, Abigail Hodge, Josefine Hundelt, Kateryna Iefremenko, Yuliia Ivashchyk, Janie Johnson, Lydia Kampitsi, Foteini- Maria Karkaletsou, Mareike Keller, Marius Keller, Hanna Kim, Havin Kiye, Martin Klotz, Luisa Koch, Voula Kokolaki, Ioanna Kolokytha, Andrei Koniaev, Alexandra König, Thomas Krause, Annika Labrenz, Alexander Lehmann, Dimitris Lithoksoou, Anke Lüdeling, Iro Malta, Maria Martynova, Gökçe Nur Mercan, Tony Müller, Mark Murphy, Daniel Naumov, Mariia Naumovets, Murat Uskan Oğuz, Zeynep Özal, Onur Özsoy, Foteini Papageorgiou, Tatiana Pashkova, Nils Picksak, Sharon Rauschenbach, Vasiliki Rizou, Guendalina Reul, Myrto Rompaki, Albrun Roy, Anastasia Rozowa, Simge Sargın Kısacık, Amalia Savva, Sam Schirm, Alina Schöpf, Christoph Schroeder, Jasmine Segarra, Tjona Sommer, Selena Song, Luka Szucsich, Johanna Tausch, Charlott Thomas, Türkan Tosun, Simge Türe, Rosemarie Tracy, Wintai Tsehaye, Nikolas Tsokanos, Elena Unger, Yelizaveta Vlasova, Heike Wiese, Fiona Wong, Rojda Emine Yasık, Media Haji Younis, Yulia Zuban, Nadine Zürn.

 

The following universities were involved in creating the corpus in the first phase of RUEG:

Potsdam University, Humboldt University of Berlin, University of Kaiserslautern, University of Mannheim, University of Stuttgart.

RUEG is funded by the German Research Foundation (“Deutsche Forschungsgemeinschaft” / DFG; project number 313607803).

 


back