JoooostB / hva-data-scientist

Individual Assignment II

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

⚠️ This repo is not maintained!: Dependencies may be outdated or not compatible anymore!

Logo

Data Engineer & Data Scientist 2018

Using the application

To start the appication we'll use the bokeh executable so make sure this is installed in your Python3 environment. Enter the bokeh serve command followed with show and the directory that bokeh uses.

bokeh serve --show bokeh_app

Individual Assignment II

In this assignment we’ll use the same dataset as before. Simulating processing and analysing of a Big Data set on your machine can be done by using several libraries. The use of these libraries and their purposes will be the topic of each lesson during this block. Main goal of the second assignment can be stated as: “Demonstrating how machine learning can be done in a Big Data Environment”

So, we want you to use:

  • Either all the data from the Kaggle dataset on hotel reviews
  • Or a dataset to be discussed with the teacher

Assignment goals:

  1. To obtain an attractive visual representation of all the data in the dataset. With visual interactive elements to support the socalled Visualisation mantra:
  • Overview: Gain an overview of the entire collection
  • Zoom: Zoom in on items of interest
  • Filter: filter out interesting items or filter in interesting items
  • Details: On demand; select an item or group and get relevant information accordingly
  1. To simulate big data and RAM problems, additional libraries are used
  • In case of R , for instance the library FFBASE yv
  • In case of Python, for instance the library PyTable (After some initial selection cleaning the result should be written away as a Review_pos.csv and Review_neg.csv)
  1. All of the dataset is stored in a NOSQL database, for instance MONGODB. A live connection to filter data during the process of running the script should be implemented:
  • There should be a collection containing all of the data of the Kaggle dataset having the following structure
    • Hotel_Address text,
    • Hotel_Name text,
    • Lat double,
    • Lng double,
    • Average_Score double,
    • Total_Number_of_Reviews int,
    • Additional_Number_of_Scoring int,
    • Reviewer_Nationality text,
    • Review_Date text,
    • Review text,
    • Review_Word_Counts int,
    • Total_Number_of_Reviews_Reviewer_Has_Given int,
    • Reviewer_Score double,
    • Tags text,
    • Sentiment int, additional field indicating a positive Review 1, or a negative review 0 *There should be a collection of balanced set of reviews, for instance a collection consisting of 10.000 positive and 10.000 negative reviews having a least the following structure
    • Review text,
    • Sentiment int
  1. At least one more or less advanced feature should be implemented.
  • Either the student simulates parallel processing power. For example, a sentiment analysis of the hotel reviews is solved by using Spark.
  • Or the student demonstrates state-of -the-art algorithms. For example, a prediction on the sentiment of a review using the length of the review be built using Keras

About

Individual Assignment II

License:GNU General Public License v3.0


Languages

Language:Jupyter Notebook 91.0%Language:Python 9.0%