simongonzalez / reddit_ausEng

Data storage for the Australian Reddit Text DB

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

reddit_ausEng

Data storage for the Australian Reddit Text DB

Folder Structure

  • 1_reference: information on the indexes of all users/authors in the database
  • 2_raw: raw posts as extracted from Reddit
  • 3_counts: counts of all texts and their morphological/lexical components
  • 4_NLP: NLP tagging for each post
  • 5_frequencies: frequencies for the morphological/lexical components, across all topics/regions
  • 6_frequenciesOndividual: frequencies for the morphological/lexical components, for each individual user/author

Abstract

A Text Database from Reddit: A Case for Australian English

The advent of social media platforms has transformed the way people communicate and express themselves. These platforms generate an enormous amount of textual data, which can provide valuable insights into various aspects of human behaviour, language use, and societal trends. One compelling reason for creating such databases is the ability to study the change of language in real-time and capture the dynamic nature of communication within online communities.

Despite the growing interest in text databases derived from social media platforms, there is a notable absence of such resources specifically focused on Australian English. This gap poses challenges for researchers seeking to analyse linguistic phenomena, sociocultural patterns, and regional variations within the Australian context. To address this limitation, we present a novel dataset from Reddit, in Australia, which fills the existing void and offers valuable insights into language use in this specific linguistic domain.

The dataset consists of over 200K posts by over 10K users spread across all regions in Australia. These posts represent a variety of topics, covering areas such as politics, entertainment, sports, and daily life. The dataset comprises a total of ~10M words, providing a concise yet comprehensive representation of the Australian English lexicon within the context of social media discourse. Among these words, there are 300K unique words, capturing the core vocabulary employed by Australian English speakers in online communication.

Reddit's anonymity, pseudonymity, and no limit of number of characters per post often encourage users to express themselves more freely, leading to the emergence of novel linguistic phenomena such as internet slang, memes, and other innovative language practices. These linguistic innovations can be studied to better understand the rapid evolution of language in online spaces and its relationship with identity, social networks, and cultural trends.

About

Data storage for the Australian Reddit Text DB