natural-language-processing natural-language-understanding nlp regex regular-expression tokenization tokenizer tweet-preprocessing tweets

RegEx-Based Tweet Tokenizer

This project is a general-purpose regular expression-based tokenizer for tweets. In order to highlight the power and limitations of a purely regular expression-based approach, tokenization is performed by pattern matching with a single regular expression; conditional statements and substitutions are deliberately not utilized.

All the scripts are placed inside a Jupyter notebook, which also includes a detailed write-up covering the following:

Definition of a token (and the underlying rationale)
Design decisions in the implementation of the tokenizer
Walkthrough of the implementation of the tokenizer
Descriptive statistics of the corpus after tokenization
Analysis of the power and limitations of the tokenizer
Comparative analysis with the state-of-the-art NLTK TweetTokenizer
Performance (running time) of the tokenizer
Analysis of the most frequent tokens

This is a major course output in an introduction to natural language processing class under Mr. Edward P. Tighe of the Department of Software Technology, De La Salle University.

Built Using

This project is a Jupyter notebook, with the following Python libraries and modules used:

Library/Module	Description	License
`pandas`	Provides functions for data analysis and manipulation	BSD 3-Clause "New" or "Revised" License
`csv`	Implements classes to read and write tabular data in CSV format	Python Software Foundation License
`regex`	Provides additional functionality over the standard `re` module while maintaining backwards-compatibility	Apache License 2.0
`nltk` (For comparative analysis of resulting tokenization)	Provides interfaces to corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning	Apache License 2.0

The descriptions are taken from their respective websites.

Author

Mark Edward M. Gonzales
mark_gonzales@dlsu.edu.ph
gonzales.markedward@gmail.com

The dataset of tweets was scraped by Mr. Edward P. Tighe of the Department of Software Technology, De La Salle University. All the tweets in this dataset are public tweets collected via the Twitter API.

About

General-purpose regex-based tweet tokenizer that employs pattern matching with a single regular expression

natural-language-processing natural-language-understanding nlp regex regular-expression tokenization tokenizer tweet-preprocessing tweets

Languages

Language:Jupyter Notebook 100.0%