leon20121005 / Foursquare

Practicing Spark MapReduce

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement: Spark
Programming Language: Python3

Environment: Windows10
Setup: 1. Download spark-2.1.1-bin-hadoop2.7.tgz from https://spark.apache.org/downloads.html
       2. Extract it into D drive
       3. Set environmental variables: SPARK_HOME, D:\spark-2.1.1-bin-hadoop2.7
                                       PATH,       D:\spark\spark-1.6.1-bin-hadoop2.6\bin
       4. Download Windows Utilities form https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin
       5. Extract it into D:\spark\spark-1.6.1-bin-hadoop2.6\bin
       6. Check the Python environment variables

Source codes: problem01.py
              problem02.py
              problem03.py
              problem04.py

Compile: 1. Open the cmd and type cd /d PATH (the file exist)
         2. Type spark-submit FILE_NAME (ex: problem01.py)
         3. It will print the top 20 result (the number of the printed result can be changed by edit the source code)

Because that the Café in the venue_info.txt caused the decode error in Python3.
Python3 used 'cp950' (standard of Big5) to decode '\xe9' (é) and occurred error.
So we created the file new_venue_info.txt, changing all the Café into Cafe.
And used the new file to run problem 3.

Efficiency: problem01.py, 7.905024s
            problem02.py, 7.228653s
            problem03.py, 16.436811s
            problem04.py, 9.705130s

The uploaded files only include source codes and new_venue_info.txt, which is created by myself.
Please use new_venue_info.txt to run problem03.py. Thank you.

About

Practicing Spark MapReduce


Languages

Language:Python 100.0%