yiqin / Hackathon_Hacker_Data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hackathon_Hacker_Data

http://104.197.20.219/yiqin/facebook_hackathon_hacker.html

Project Description

We get the data from Facebook using Facebook Graph API. The dataset are from about Hackathon Hackers (HH), which has become the biggest Facebook group for hackathon attendees. Currently, it has over 18k members. It's a place to discuss hackathons, tech news, college, high school, and dank memes. It had 53 different public subgroups. One of hbase tables I built is to get the top 20 active users in most 25 popular subgroups.

Facebook limits developers to any public graph data at Facebook due to the privary. This dataset in csv file is 245.5 MB. The time range is from 7/1/2014 - 8/20/2015, which is more than one year. This is the larget dataset I could get from Facebook Graph API. It has more than 1,300,000 items. It includes information about members' likes, comments and posts.

Files

I upload all files to svn. These files are also upoad to different directory in the cluster.

hadoop-m:/mnt/scratch/yiqin

  • create_Facebook_Graph_Data_Table.pig
    To create the hbase table for the most popular subgroup.

  • top_user.pig To create the hbase table for the top active users in each subgroup. I use macro to reused code.

  • general_information.pig To create other information. These parts are not stored into hbase table due to time limit. It includes Types count, Like ranking, and Likes and comments average.

  • uber-yiqin-0.0.1-SNAPSHOT.jar Java application. It includes Thrift and SerializeFacebookGraphNode.

  • input Includes .csv files, which are raw data from Facebook Graph API.

hadoop fs -ls /mnt/scratch/yiqin

Found 1 items -rw-r--r-- 2 mpcs53013 hdfs 212250910 2015-12-08 04:46 /mnt/scratch/yiqin/Facebook_Graph_data

webserver:/var/www/html/yiqin

  • background.jpeg

  • elegant-aero.css

  • table.css These three files are for UI element in html.

  • facebook_hackathon_hacker.html
    html file used for the url. It includes three parts: most popular subgroup, top active users, and submitting new Facebook grasph node data.

webserver:/var/www/cgi-bin/yiqin

  • facebook_Graph_Data_Table.pl To get the hbase table of most popular subgroups.

  • top_user.pl To get the hbase table of top active users.

  • submit_new_data.pl
    To send new data to kafka.

kafka topic

I have created a topic yiqin-facebook on kafka.

###############################################

Other information

Mainly for me to review the project, not for grading

Please ignore it.

###############################################

Current Progress

The batch layer is ready. csv dataset are uploaded /mnt/scratch/yiqin/input in the cluster. facebookGraphNode.thrift is created to process the data. uber-yiqin-0.0.1-SNAPSHOT.jar is also in the cluster. create_Facebook_Graph_Data_Table.pig is to create the table and show the table. You can see the data are stored.

Facebook graph data are so difficult to deal with, but contain interesting information. I'm going to finish the project in the next 4 days.

Subgroup

Subgroup must satisfy one of these prerequisites:

  • 1 week old, 5 posts, and 250 members
  • 3 weeks old, 10 posts, and 100 members

(Hackathon Hackers,522870) (HH: What Are You Working On?,40649) (HH Design,39851) (HH Hacker Problems,14539) (Hackathon Hackers EU,14432) (HH Data Hackers,13657) (HH Webdev,9903) (HH iOS,5349) (HH Throw a Hackathon,5047) (HH: Snackathon Snackers,4122) (HH: VR,2902) (HH Free Stuff,2772) (HH Growthhacking,2409) (HH CTF,1481) (HH Canada Eh?,1290) (HH Skillshare,1180) (HH Blog Posts,1066) (HH Connect,1042) (HH FIRST + VEX,976) (HH: Book Club,719) (HH EdTech,695) (HH South,542) (HH Python,479) (HH Texas,419) (HH Social Good,392) (HH Systems Programming,374) (HH Africa,358) (HH Hardware Hackers,301) (HH Futurism,298) (HH Internet of Things,197) (HH: Code Reviews,191) (Hackathon Hackers Asia,186) (HH Constructive Debates,128) (HH Product Launch,86) (HH: Share Your Projects,67) (HH λ,62) (Hackathon Hackers South East Asia (SEA),58)

Types

(like,516458) (comment,155369) (status,11430) (link,6118) (photo,859) (video,688) (event,164) (note,2) (offer,1)

Like ranking

(Hackathon Hackers,1000,status,#hackerinchief) (Hackathon Hackers,1000,status,Mac is now supporting Windows!) (Hackathon Hackers,903,photo,git commit -m "Fixed interface issues."

source: twitter) (Hackathon Hackers,757,photo,Zuck actually checks his facebook.) (Hackathon Hackers,645,status,) (Hackathon Hackers,626,photo,Yo! This guy's license plate says "NODE JS" #paloalto) (Hackathon Hackers,606,status,ohhhhhhhhhh babyyyyyy ;)) (Hackathon Hackers,586,link,Thinking of dropping out? I wrote a bit on what you'll go through.) (Hackathon Hackers,540,status,Who's down to bring a hackathon to Ahmed's community? We must encourage that kid to keep building and educate those around him. #hellyeah?) (Hackathon Hackers,517,status,Seems legit!)

Likes and comments average

We only consider status, links, photo, video, comment.

(link,11.275253350768224,5.315299117358614) (photo,28.21885913853318,8.679860302677533) (video,9.795058139534884,4.125) (status,10.356167979002624,9.236482939632547) (comment,1.9140304693986574,0.0)

Instruction

Upload perl script to Webserver on the Cluster

gcloud compute copy-files facebook_Graph_Data_Table.pl webserver:/tmp

move the perl script to /var/

sudo mv /tmp/facebook_Graph_Data_Table.pl /var/www/cgi-bin/yiqin/

change mod.

sudo chmod 777 facebook_Graph_Data_Table.pl

Upload jar to Hadoop on the Cluster

gcloud compute copy-files uber-yiqin-0.0.1-SNAPSHOT.jar hadoop-m:/mnt/scratch/yiqin

Run jar on Hadoop on the Cluster

hadoop jar uber-yiqin-0.0.1-SNAPSHOT.jar edu.uchicago.yiqin.SerializeFacebookGraphNode /mnt/scratch/yiqin/input Don't forget the class, which contains the main.

Change Pig file to Cluster mode.

Open url

http://104.197.20.219/cgi-bin/yiqin/facebook_Graph_Data_Table.pl

kafka

create a topic

kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic yiqin-facebook

Read the data

kafka-console-consumer.sh --zookeeper localhost:2181 --topic yiqin-facebook --from-beginning

About


Languages

Language:Perl 24.3%Language:PigLatin 23.5%Language:CSS 17.7%Language:Java 17.1%Language:HTML 13.5%Language:Thrift 3.9%