textual-analysis hadoop-mapreduce reddit-data reddit-sentiment-analysis classify-comments

Reddit-Sentiment-Analysis

2-part Mapreduce Program that performs textual analysis of Reddit data (approx. 300 GB of JSON data) preprocessed by another team member. This program performs textual sentiment analysis on reddit comments determined by preprocessing to be discussing either Donald Trump, Hillary Clinton, or both, and summarizes the data.

The preprocessing is assumed to have already screened comments by date and topic (Trump and Clinton). Per the specifications of the project, we limited our scope to comments made between July 19th, 2016, through November 8th, 2016.

PART 1:

TO COMPILE PROGRAM:

$ mkdir build

$ $HADOOP_HOME/bin/hadoop com.sun.tools.javac.Main *.java -d build -Xlint

$ jar -cvf SentimentAnalysis.jar -C build/ .

$ rm -r build

TO RUN:

This assumes you have all text files (ExampleInput.txt, negate-words.txt, pos-words.txt, and neg-words.txt) in /sentimentAnalysis directory in hdfs. Modify the paths to reflect any differences.

$HADOOP_HOME/bin/hadoop jar SentimentAnalysis.jar org.SentimentAnalysis.Driver /sentimentAnalysis/ExampleInput.txt /sentimentAnalysis/out -negation /sentimentAnalysis/negate-words.txt -pos /sentimentAnalysis/pos-words.txt -neg /sentimentAnalysis/neg-words.txt

As-is, it will take /sentimentAnalysis/ExampleInput.txt, run the program, and store the results in /sentimentAnalysis/out. This can be modified to a directory of input files by replacing sentimentAnalysis/ExampleInput.txt with /your-HDFS-Directory/

Part 2:

Part 2 takes the output from part 1, and summarizes the data. It is hardcoded to utilize the partitions defined in details.md, but could be altered easily to read partition data from a file, etc.

TO COMPILE PROGRAM:

$ mkdir build

$ $HADOOP_HOME/bin/hadoop com.sun.tools.javac.Main *.java -d build -Xlint

$ jar -cvf SentimentAnalysis.jar -C build/ .

$ rm -r build

TO RUN:

$ $HADOOP_HOME/bin/hadoop jar Summary.jar org.Summary.Driver /SentimentAnalysis/out /SentimentAnalysis/summary

About

Program that performs textual analysis of Reddit data (approx. 300 GB) preprocessed by another team member. Uses Hadoop's Mapreduce to classify comments as either positive or negative based on certain keywords, negation, etc.

textual-analysis hadoop-mapreduce reddit-data reddit-sentiment-analysis classify-comments

Languages

Language:Java 100.0%