Summary

Paris Stories is a corpus of oral French collected and transcribed by Linguistics students from Sorbonne Nouvelle and corrected by students from the Plurital Master's Degree of Computational Linguistics ( Inalco, Paris Nanterre, Sorbonne Nouvelle) between 2017 and 2021. It contains monologues and dialogues from speakers living in the Parisian region.

Introduction

For an assignment, students had to record a friend or a relative sharing an anecdote about a given theme (meaningful encounters, vacations, interesting stories..). The corpus was created for the study of contemporary spoken French and to train a syntactic parser for spoken French. All data has been morpho-syntactically annotated following the SUD (Surface Syntactic Universal Dependencies) guidelines.

See SUD Guidelines : https://surfacesyntacticud.github.io/guidelines/u/

The Treebank can be found here : http://universal.grew.fr/?corpus=SUD_French-ParisStories@latest

The recordings can be downloaded via the url given in the '# sound_url' metadata.

Description

-- Paris Stories 2019 --

Creation Year : 2017

Annotation Year : 2019

Size :

19 samples
13951 tokens
709 sentences
app. 1 hour of recordings

Topics : travels, funny/unusual stories

-- Paris Stories 2020 --

Creation Year : 2018

Annotation Year : 2020

Size :

16 samples
9064 tokens
553 sentences
app. 30 min of recordings

Topics : vacation stories, funny/unusual stories

-- Paris Stories 2021 --

Creation Year : 2020

Annotation Year : 2021

Size :

14 samples
7825 tokens
499 sentences
app. 45 minutes of recordings

Topics : first encounters, funny/unusual stories

Development

The corpus is maintained here in the SUD framework and automatically converted into UD_French-ParisStories using the Grew software with the conversions rules described here.

Acknowledgments

Annotation : Sylvain Kahane, Bruno Guillaume, Mariam Nakhlé, Vanessa Gaudray-Bouju, Menel Mahamdi

Annotation tools development : Kim Gerdes, Marine Courtin, Gaël Guibon

Conversion and handling of data validation : Bruno Guillaume

Direction of data collection : Cédric Gendrot, Kim Gerdes, Marine Courtin

We would like to thank all the students who participated in this project.

References

Sylvain Kahane, Bernard Caron, Emmett Strickland, Kim Gerdes. Annotation guidelines of UD and SUD treebanks for spoken corpora: A proposal. Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021)

gguibon / SUD_French-ParisStories

Summary

Introduction

Description

Development

Acknowledgments

References

About