TylerToren / activity04-data-pipelines

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Activity 4 - Data Pipelines

It is assumed that you have read Sections 5.1 - 5.4 from R4DS and completed the Isolating Data with dplyr Primer.

In this activity, you will:

  • Isolate variables and observations of a dataset using {dplyr}.
  • Rearrange a dataset by variables using {dplyr}.
  • Use the pipe %>% from {magrittr} to build data pipelines.
  • Name R code chunks in an .Rmd report.

☑️ Task 1: The Workflow

Remember to take these steps slowly, help each other out, and get a hold of your instructor when you have questions or issues.

  1. In this GitHub repo, click on the fork Fork icon near the upper-right-hand corner. You should be taken a copy of this repo that is in your GitHub account - your page title should be {username}/activity04-data-pipelines, where {username} is replaced with your GitHub username.
  2. Click on the green Code button.
  • Verify that the drop-down identifies that you are using the HTTPS method (this is probably the default view; otherwise, select “HTTPS”).
  • Click on the clipboard icon to copy the repo HTTPS information.
  1. Log in to the RStudio Server.
  • Verify that you are in an RStudio session (it doesn’t matter if it is a previous Project session or a “vanilla” RStudio session).
  1. Create a new Project. You can do this by clicking on the new project icon or through the menus (File > New Project…).
  • In the New Project Wizard pop-up, select Version Control on the Create Project screen, then select Git on the Create Project from Version Control screen.
  • On the Clone Git Repository screen, paste the HTTPS information from (2) into the Repository URL dialog box. It should look like: https://github.com/<username>/activity04-data-pipelines.git
  • The Project directory name dialog box should automatically populate with your repository name, but sometimes Macs have an issue with this (if so, click into this box and press the command key command key on your keyboard). It should look something like: activity04-data-pipelines
  • In the Create project as subdirectory of dialog box, click on Browse.
  • In the Choose Directory pop-up, navigate to your class-level folder (i.e., you were encouraged to create a folder named either STA418 or STA518) You were also encouraged to create anactivities folder within your class-level folder to help organize our materials. Once you have navigated to the folder you wish this repo to be located, click Choose.
  • Verify that the Create project as a directory of dialog box contains the folder location that you previously specified, then click on Create Project.
  • You may be asked to login with your GitHub credentials on a Clone Repository pop-up window. Provide your GitHub username and PAT (not your GitHub password) if prompted.
  1. After a few seconds, your RStudio session will refresh and you should be in your newly created RStudio Project!

Starting with Activity 5, I will be no longer provide these steps and simply saying, “fork Fork this repo and clone it to a new RStudio Project”. Also, setting up the Activity repo will be part of your Preparation “Do” work. Remember that more detailed directions are provided in this Activity.

pause

Planned Pause Point: If you have any questions, contact your instructor or another group.

The Pipe %>%

I strongly encourage you to use pipes %>% when working throughout this semester. Pipes are from {magrittr} and are a way to take nested functions and make them more informative to read and write. Consider the following lyrics from Bradford’s college days:

To the left
Take it back now y’all
One hop this time
Right foot let’s stomp
Left foot let’s stomp
Cha cha real smooth

One way to write this out is with nested functions:

cha_cha(lets_stomp(lets_stomp(hops(take_it(to_the(direction = "left"), direction = "back", when = "now", who = "y'all"), this_time = 1), foot = "right"), foot = "left"), style = "real smooth")

If we start in the inner-most function, we can then work our way out to see what we are trying to do. Pipes, however, clean this up:

to_the(direction = "left") %>% 
  take_it(direction = "back", when = "now", who = "y'all") %>% 
  hops(this_time = 1) %>% 
  lets_stomp(foot = "right") %>% 
  lets_stomp(foot = "left") %>% 
  cha_cha(style = "real smooth")

Writing this using pipes give the statement a more natural structure. I will refer to this process of piping functions a “pipeline”.

Mario on a pipe

Starting with R version 4.1.0, there is now a native pipe |> in R! However, this pipe does not have one of the key features that the {magrittr} pipe does which is that you do not necessarily need the previous step to be the first argument in the next step. Therefore, I encourage you to stick with the {magrittr} pipe.

☑️ Task 2: Complete the RMarkdown File

The activity04-data-pipelines.Rmd file contains the directions for this activity. For the rest of this class period, you will complete the RMarkdown document with your neighbor(s). Your instructor will be circling and be available to help when needed.

Note that each person is working in their own repo. We are not worrying about collaborating for the time being and instead will be working on being more comfortable with the workflow for working between RStudio and GitHub.

However, do not continue in this README document until you and your neighbor(s) have completed your .Rmd files.

Cat at work

☑️ Task 3: Reflection

Take 5 minutes to write a reflection on what you feel confident in and what you need to spend some time better understanding. What is one thing you can do to help clarify your current misunderstandings?

Next: Activity 5 will focus on isolating rows and columns of datasets.

About