aronpalk / CEU-DV2

Materials for the "Data Visualization 2" class at CEU

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This is the R script repository of the "Data Visualization 2: Practical Data Visualization with R" course in the 2019/2020 Winter term, part of the MSc in Business Analytics at CEU.

Table of Contents

Schedule

3 x 2 x 100 mins on Jan 8, 15, 22:

  • 13:30 - 15:10 session 1
  • 15:10 - 15:30 break
  • 15:30 - 17:10 session 2

Syllabus

Please find in the syllabus folder of this repository.

Technical Prerequisites

Please bring your own laptop and make sure to install the below items before attending the first class:

  1. Install R from https://cran.r-project.org
  2. Install RStudio Desktop (Open Source License) from https://www.rstudio.com/products/rstudio/download
  3. Register an account at https://github.com
  4. Enter the following commands in the R console (bottom left panel of RStudio) and make sure you see a plot in the bottom right panel and no errors in the R console:
install.packages(c('ggplot2', 'gganimate', 'transformr', 'gifski'))
library(ggplot2)
library(gganimate)
ggplot(diamonds, aes(cut)) + geom_bar() +
    transition_states(color, state_length = 0.1)

Optional steps I highly suggest to do as well before attending the class if you plan to use git:

  1. Bookmark, watch or star this repository so that you can easily find it later

  2. Install git from https://git-scm.com/

  3. Verify that in RStudio, you can see the path of the git executable binary in the Tools/Global Options menu's "Git/Svn" tab -- if not, then you might have to restart RStudio (if you installed git after starting RStudio) or installed git by not adding that to the PATH on Windows. Either way, browse the "git executable" manually (in some bin folder look for thee git executable file).

  4. Create an RSA key (optionally with a passphrase for increased security -- that you have to enter every time you push and pull to and from GitHub). Copy the public key and add that to you SSH keys on your GitHub profile.

  5. Create a new project choosing "version control", then "git" and paste the SSH version of the repo URL copied from GitHub in the pop-up -- now RStudio should be able to download the repo. If it asks you to accept GitHub's fingerprint, say "Yes".

  6. If RStudio/git is complaining that you have to set your identity, click on the "Git" tab in the top-right panel, then click on the Gear icon and then "Shell" -- here you can set your username and e-mail address in the command line, so that RStudio/git integration can work. Use the following commands:

    $ git config --global user.name "Your Name"
    $ git config --global user.email "Your e-mail address"

    Close this window, commit, push changes, all set.

Find more resources in Jenny Bryan's "Happy Git and GitHub for the useR" tutorial if in doubt or contact me.

Class Schedule

Will be updated from week to week.

Week 1

  1. Warm-up exercise and security reminder: 1.R
  2. Intro / recap on R and ggplot2 from previous courses by introducing MDS: 1.R
  3. Scaling / standardizing variables: 1.R
  4. Simpson's paradox: 1.R
  5. Intro to data.table: 1.R

Suggested reading:

Homework:

  1. Load bookings data from http://bit.ly/CEU-R-hotels-2018-prices and the hotel features from http://bit.ly/CEU-R-hotels-2018-features
  2. Count the number of 4 stars hotels in Hungary
  3. Compute the average rating of 4 and 5 star hotels in Hungary and Germany
  4. Round up the previously computed average rating to 2 digits
  5. Do we have any bookings in unknown hotels (as per the features dataset)?
  6. Clean up the bookings dataset from bookings from unknown hotels and print the number of remaining bookings
  7. What's the average distance of hotels from the city central in Budapest
  8. List all neighbourhoods in Budapest
  9. Compute the average distance from the city center for the neighbourhoods in Budapest
  10. Count the number of bookings in Hungary

Homework extra:

  1. Create a scatterplot on the iris dataset using the length and width of sepal + 4 linear models (3 colored lines per species, 1 black line fitted on the global dataset)

Submission: prepare an R markdown document that includes the exercise as a regular paragraph then the solution in an R code chunk (printing both the code and its output) and knit to HTML or PDF and upload to Moodle before Jan 14 midnight (CET)

Week 2

  1. Homework solutions 2.R
  2. Hierarchical clustering, dendograms 2.R
  3. Revisit MDS with animation 2.R
  4. Anscombe's quartett 2.R
  5. Datasaurus 2.R
  6. Geocoding and loading data from the Internet 2.R

Suggested reading:

Homework:

  1. Load the nycflights13 package and check what kind of datasets exist in the package, then create a copy of flights dataset into a data.table object, called flight_data.
  2. Which destination had the lowest avg arrival delay from LGA with minimum 100 flight to that destination?
  3. Which destination's flights were the most on time (avg arrival delay closest to zero) from LGA with minimum 100 flight to that destination?
  4. Who is the manufacturer of the plane, which flights the most to CHS destination?
  5. Which airline (carrier) flow the most by distance?
  6. Plot the monthly number of flights with 20+ mins arrival delay!
  7. Plot the departure delay of flights going to IAH and the related day's wind speed on a scaterplot! Is there any association between the two variables? Try adding a linear model.
  8. Plot the airports as per their geolocation on a world map, by mapping the number flights going to that destionation to the size of the symbol!

If in doubt about the results and outputs, see this example submission prepared by Misi.

Submission: prepare an R markdown document that includes the exercise as a regular paragraph then the solution in an R code chunk (printing both the code and its output) and knit to HTML or PDF and upload to Moodle before Jan 21 midnight (CET)

Week 3

  1. Homework solutions 3.R
  2. Alternatives to boxplot 3.R
  3. Creating new variables: numeric to factor 3.R
  4. Multiple summaries 3.R
  5. Tweaking ggplot2 themes 3.R
  6. Introduction to interactive plots 3.R

HomeworkFinal project:

Participite in the 4th week of #tidytuesday at https://github.com/rfordatascience/tidytuesday -- feel free to use the bundled spotify_songs.csv dataset (or provide your owndata collected from Spotify), optionally merge external dataset(s), do data transformations that seems useful and generate data visualizations that makes sense and are insightful, plus provide comments on those in plain English.

Submission: prepare an R markdown document that includes plain English text description of the dataset, problem/question you analyzed, actual R code chunks (printing both the code and its output) doing the analysis, comments and summary of results and knit to HTML or PDF and upload to Moodle before Feb 16 midnight (CET). Please don't leave the submission for the last minute and be sure to submit by Feb 9 if you would like to get some feedback before the final deadline.

Grading: reading the required data and doing some plots in an R Markdown document as per the above specswill get you pass, but please actually spend time on getting familiar with the data, and do a proper analysis for better grades (no hard-specs, though, but make use of your common sense).

Contact

File a GitHub ticket.

About

Materials for the "Data Visualization 2" class at CEU


Languages

Language:R 100.0%