Big-Data-in-Linguistics

Supporting codes for big-data analysis applied to linguistics.

PRESENTATION OF THE PROJECT

Big-Data-in-Linguistics is a joint endeavour motivated by the need to analyse bigdata from linguistic-related databases. The project started when it became clear that handling a vast amount of data in Excel was simply too complicated and time consuming. The codes were written by Giovanni Merici, graduate student in biology and master student in "Scienze Biomolecolari, Genomiche e Cellulari" at the Università degli Studi di Parma (Parma, Italy). They serve as a support to the first year master thesis (mémoire 1) titled "Conjonctions de subordination et locutions conjonctives françaises en diachronie: Étude d’un aspect de l’intégration syntaxique de la langue du latin au français classique dans le cadre de la constitution de la phrase complexe et du passage de la parataxe à l’hypotaxe", in english "French subordinating conjunctions in diachrony : study of an aspect of syntactic integration from latin to classical french as a part of the construction of the complex sentence and of the transition from parataxis to hypotaxis", of Axelle Domingues, graduate student in modern letters and master student in "Sciences du langages : linguistique française et générale" at Sorbonne Université (Paris, France). Thus, this repository is the result of this collaboration in which Giovanni Merici, positively answering to Axelle Domingues' request, actively helped to greatly improve and take to new levels data representation in that master thesis through coding. Its content and use is explicited in depth in Chapter 2 of the finished work. Therefore we will now limit ourselves to present the content of the repository as well as the methodology used to put it together.

PRESENTATION OF THE DATA AND ITS USE

The codes offer graphs that reveal without data loss the many tendencies in the evolution of several french subordinating conjunctions over centuries : "pour que" and "parce que", of prepositional formation, "tandis que" and "aussitôt que", of adverbial formation, "à condition que" and "à mesure que", of nominal formation and finally "moyennant que" and "attendu que", from present and past participial formation. For a more in depth presentation of the data as well as the methodological approach taken, cf. §2.1 and §2.2 of the work. However, to sum up the main points, we will precise that the graphs are used to represent the diachronic evolution of these subordinating conjunctions from old to classical french, more precisely from 1125 to 1799. The data is extracted from two electronic corpora : la Base de Français Médiéval (BFM) http://bfm.ens-lyon.fr/ and Frantext https://www.frantext.fr. The first database's corpus extends from the 10th century to the 15th century. The second's goes all the way to present time but does not date as back as the BFM. Such precision explains possible gaps in results in between both bases. Also, the choice to use these two databases at once is self-explained by the apparent complementarity. In total, an amount of 59 118 entries from both databases wiLl be processed by the codes. Both Jupyter Lab and Google Colab were used as computational environments. The libraries used were Pandas for handling dataframes, NumPy to work with arrays, Matplotlib and Seaborn to plot graphs. Codes started being written on Jupyter Lab but the switch to Google Colab occured when Seaborn started being used in order to plot subplots. Google's servers allowed a quicker execution of the codes than our computers which would take time plotting subplots of a vast amount of data.

CONTENT OF THE REPOSITORY

-MAIN BRANCH

Given the possibility of exporting data from the databases' websites directly in an excel file it's possible to import it into the script and quickly sort the data to obtain the desidered graphs. Therefore not only does this repository contains the codes used in order to plot the graphs presented in the figures of the work, it also contains codes that stand as proof of concept for them. All these can be found in the file called "Tools" in the main branch written on Jupyter Lab using Pandas, NumPy and Matplotlib. In order of apparition in that file, they include tools to : import data from BFM and Frantext, check the quality of a dataframe (that all data is numeric), create a dataframe with a specific set of forms, create a larger dataframe by manually adding specific form in order to keep track of the progression, plot graphs (kdeplot, histplot, boxplot) for one dataframe, plot graphs (kdeplot, histplot, boxplot) for two ore more dataframes, search for a specific date or form within one dataframe with an offline research tool, search for cases of copresence within a sames years and count the number of years present within one dataframe. The two tools that allow to import data from BFM and Frantext need to be considered with the other files called "BFMInstructions" and "FrantextInstructions" in the main branch. These two give a step-by-step explaination on how to export data from one database in an ordered excel file that only then can be imported in the script using the tools. One last file we did not yet speak of is the one called "SeabornStandardCode". It was written by Giovanni Merici in order to give Axelle Domingues the ressources needed to plot graphs using the library Seaborn. The result of that work shows in the thesis-specific branch. The aim of this branch is to show proof of concept for the codes written for the work as well as to give extra tools that could be useful for future use of these databases. In order to look at the specific codes written and used for the maste thesis, cf. thesis-specific branch.

-THESIS-SPECIFIC BRANCH

This branch contains three files : the "README" that presents it in depth, "SeabornEntireCorpus" and "SeabornVariation". As the titles show, these codes used the Seaborn library instead of Matplotlib and were therefore written on Google Colab. The aim of this branch is to show the codes that yielded the entirety of the graphs used in the first year master thesis of Axelle Domingues.

ABSTRACT OF THE FINISHED MASTER THESIS

"French subordinating conjunctions in diachrony" The construction of the complex sentence as a structured and conventional unit in the French language dates back only to the end of classical French, meaning the end of the 18th century. From a diachronic perspective, the completion of this linguistic phenomenon is recent. Nonetheless, it is the result of the evolution of syntactic integration in French for over more than six centuries, starting from the earliest stage of the language. This master thesis focuses on a specific aspect of this area of study: subordinating conjunctions. It proposes the empirical analysis of their morphosyntactic evolution in diachrony from Latin to classical French, and of their role as hypotactic links in order to clarify the chronology of the tendencies leading to the construction of the complex sentence. Thus, we show that the focus of this study is indeed revealing of the transition from parataxis to hypotaxis and of the development of subordination in French. Consequently, it adds on the primary aim of research in French diachronic linguistics, that is, the description sustained by empirical data of its evolution from its mother language, Latin.

Keywords: subordinating conjunctions, hypotaxis, subordination, complex sentence, parataxis, Latin, French, diachrony, morphosyntax, empirical and analytical approach, functional-typological approach.

GiovanniMerici / Big-Data-in-Linguistics

Big-Data-in-Linguistics

About

Languages