leon20121005/Foursquare

Implement: Spark
Programming Language: Python3

Environment: Windows10
Setup: 1. Download spark-2.1.1-bin-hadoop2.7.tgz from https://spark.apache.org/downloads.html
2. Extract it into D drive
3. Set environmental variables: SPARK_HOME, D:\spark-2.1.1-bin-hadoop2.7
PATH, D:\spark\spark-1.6.1-bin-hadoop2.6\bin
4. Download Windows Utilities form https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin
5. Extract it into D:\spark\spark-1.6.1-bin-hadoop2.6\bin
6. Check the Python environment variables

Source codes: problem01.py
problem02.py
problem03.py
problem04.py

Compile: 1. Open the cmd and type cd /d PATH (the file exist)
2. Type spark-submit FILE_NAME (ex: problem01.py)
3. It will print the top 20 result (the number of the printed result can be changed by edit the source code)

Because that the Café in the venue_info.txt caused the decode error in Python3.
Python3 used 'cp950' (standard of Big5) to decode '\xe9' (é) and occurred error.
So we created the file new_venue_info.txt, changing all the Café into Cafe.
And used the new file to run problem 3.

Efficiency: problem01.py, 7.905024s
problem02.py, 7.228653s
problem03.py, 16.436811s
problem04.py, 9.705130s

The uploaded files only include source codes and new_venue_info.txt, which is created by myself.
Please use new_venue_info.txt to run problem03.py. Thank you.

About

Practicing Spark MapReduce

Languages

Language:Python 100.0%