Independent Study in MOOC Forum Mining

Study Background

This repo contains code and technical documentation for a research project in classifying posts within MOOC discussion forums. The MOOC was offered as a course through UNC's School of Information and Library Science in the Fall of 2013, and was hosted on the Coursera platform.

The motivation for the study was to provide an experimental basis for an automated tool to alert instructors to posts within forums that may warrant manual intervention. MOOCs are extremely popular in their initial enrollment with thousands of students enrolling in courses in a very short amount of time. While these course enrollments are quite high, many MOOCs see only 5-10% of their students finish the course or earn a statement of accomplisment. This makes course management extremely challenging, and this challenge presents an opportunity for automated machine learning tools to help predict which posts instructors should focus on. More details are presented in the paper, located at paper_latex_files/shaffer_mooc_study.pdf

Files and Repo Structure

There are several directories within the repo containing different types of files necessary for the study. Each is detailed below. To get a better sense of the background for the study and the results, you can take a look at the paper in paper_latex_files and read the paper there. Latex files are also included as well. Additionally, HTML, JavaScript, and CSS files for building the data collection interface used in the study can be found in the interface directory. Finally, the code directory contains Python code that was used for manipulating the raw forum data, extracting and engineering features, and running machine learning experiments.

Paper Files

Relevant files:

shaffer_mooc_study.pdf: final paper reporting on project and analyses run.
shaffer_mooc_study.tex: LaTeX code for generating final report.

Interface

This directory contains HTML, JavaScript and CSS used for building the data collection interface used in this study. Relevant files are detailed below.

index.html: Main HTML interface MTurk workers used to annotate our dataset.
instructions.html: HTML file with instructions given to MTurk workers on how to annotate the dataset and the definitions that would be used for the class labels we needed to collect.
thread.html: HTML file presenting MTurk workers with individual thread and outlined post to be annotated.

Code

This directory contains two sub-directories: one for processing data and constructing features, and one for running machine learning experiments and the ablation analysis in the paper. To run any of this code (or code modified from this directory) you need to have a few extremely useful scientific python packages installed including:

Since many of these can be quite tricky to install (or get talking to one another) I also recommend the Anaconda scientific Python bundle. This should install all these modules (and more!) with much less frustration than doing each individually on your own.

Data Processing

ablation_data_prep.py: script for combining previoiusly computed features, cleaning up the constructed dataset, and constructing train-test pairs for running machine learning ablation analysis.
feature_extractor.py: code for processing raw data and extracting relevant features from data. Many of these were used in the raw features section in the paper in addition to LIWC linguistic count features.
liwc_text.py: code for extracting only text, removing markup and punctuation for use with LIWC software.
mooc_datareader.py: early script for reading in Excel version of dataset and converting it to JSON.

Machine Learning

ablation_analysis.py: code used to run machine learning experiments using Logistic Regression and evaluating models with Average Precision over 10-fold cross validation, as well as a feature ablation analysis. This code generated the results section of the paper.
example_classifier.py: example script used early on to explore classifiers. Modified from an original post in the docs for Scikit-Learn.
label_assignment.py: script for assigning labels collected from MTurk workers to instances within the dataset.
ml_experiments.py: early take on simple supervised machine learning experiments with constructed dataset in preparation for ablation analysis. Results from these experiments were not reported in the paper.
results.txt: raw results from ablation analysis.

kylejshaffer / mooc_forum_mining