TwitterDataAnalysisOnHPC-COMP90024
Analysing Twitter data to obtain sentiment of different blocks in Melbourne
Motivation
Project was created to compare the overall sentiment of people in different parts of Melboure city. As the data obtained from twitter was 15 GB, we utilized resources of a high performance computer (HPC) - Spartan
(University of Melbourne's HPC) and parrellelize our program.
Results
Total number of nodes | Number of threads on each node | Execution time in seconds |
---|---|---|
1 | 1 | 991.4 |
1 | 8 | 189.6 |
2 | 4 | 193.1 |
Output
Area | Sentiment | Tweets |
---|---|---|
A1 |
763 | 2752 |
A2 |
4116 | 4904 |
A3 |
2679 | 5824 |
A4 |
54 | 381 |
B1 |
11614 | 21232 |
B2 |
32061 | 107386 |
B3 |
20211 | 34494 |
B4 |
5733 | 6643 |
C1 |
7551 | 10530 |
C2 |
191791 | 246828 |
C3 |
41434 | 69901 |
C4 |
19537 | 26097 |
C5 |
7551 | 5581 |
D3 |
7777 | 16220 |
D4 |
9698 | 16536 |
D5 |
3757 | 4705 |
Total | 361428 | 580014 |
Observation
A 5 times performance improvement was observed when running the code parallelly on 8 threads as compared to running on a single thread.
How to use?
Clone
- Clone this repo to your local machine or a HPC using https://github.com/arnavgarg123/TwitterDataAnalysisOnHPC-COMP90024.git
Setup
- Make sure you have python3 installed on your system.
- To run the script on Windows, install Microsoft MPI from this link.
- To run the script on Linux, run the following commands
sudo apt update
sudo apt install python3-mpi4py
- Using terminal/cmd navigate to the folder containing the files of this repo and run the command
Example
mpiexec -n <number_of_threads> python main.py <data_file_name> <area_file_name> <sentiment_analysis_keywords_with_score>
mpiexec -n 4 python main.py ./Data/smallTwitter.json ./Data/melbGrid.json ./Data/AFINN.txt
Contributors
Contributing
Step 1
- Clone this repo to your local machine using https://github.com/arnavgarg123/TwitterDataAnalysisOnHPC-COMP90024.git
Step 2
- HACK AWAY!
Step 3
- Create a new pull request
License
This project is licensed under the MIT License - see the LICENSE.md file for details