tlin-taolin / A-study-of-linguistic-drift-on-Le-Temps-Newspaper-Corpus

Big Data Project 2015 - A study of linguistic drift on Le Temps Newspaper Corpus

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A-study-of-linguistic-drift-on-Le-Temps-Newspaper-Corpus

EPFL - Big Data Project 2015 - A study of linguistic drift on Le Temps Newspaper Corpus

Project Description :

We have access to the archives of Le Temps newspaper, the archives cover approximately 200 years of newspaper (from 1816 to 1998). By using those archives, the goal of this project is to do some researches to quantify or represent in some way the linguistic drift across the years. Indeed, the language evolves and changes, some words appear while others disappear and we want to scientifically interpret this fact.

Project goals :

The first main goal of the project is to find a way to use the datas we have and to find a good distance metric which allows us to quantify and represent the drift between years and its evolution.

The second goal of this project would be to apply machine learning techniques on some part of the corpus (training set) and then, given a text, find which year it belongs to approximately (with a certain precision threshold to respect of course).

Team members :

  • Cynthia Oeschger (Team leader)
  • Farah Bouassida
  • Tao Lin
  • Jéremy Weber
  • Nicolas Bornand
  • Marc Schär
  • Gil Brechbühler
  • Malik Bougacha

About

Big Data Project 2015 - A study of linguistic drift on Le Temps Newspaper Corpus


Languages

Language:Java 51.4%Language:TeX 15.5%Language:TypeScript 9.8%Language:Scala 8.7%Language:Shell 7.5%Language:JavaScript 3.0%Language:HTML 1.7%Language:Python 1.5%Language:PHP 0.4%Language:Limbo 0.2%Language:CSS 0.2%Language:Batchfile 0.1%Language:Ruby 0.1%