- Read/ annotate: Recipe #7. You can refer back to this document to help you at any point during this lab activity.
- Note: do your best to employ what you've learned and use other existing resources (R documentation, web searches, etc.).
- Gain experience working with coding strategies reshaping data using tidyverse functions and regular expressions.
- Practice reading/ writing data from/ to disk
- Implement organizational strategies for organizing and documenting a dataset in reproducible fashion.
- Create a new R Markdown document. Title it "Lab 7" and provide add your name as the author.
- Edit the front matter to have rendered R Markdown documents print pretty tabular datasets.
- Delete all the material below the front matter.
- Add a code chunk directly below the header named 'setup' and add the code to load the following packages and any others you end up using in this lab report. Add
message=FALSE
to this code chunk to suppress messages.
- tidyverse
- readtext
- tidytext
- also include
source()
to source thefunctions/functions.R
file
- Create two level-1 header sections named: "Overview" and "Tasks".
- Under "Tasks" create four level-2 header sections named: "Orientation", "Tidy the data", "Write the dataset", and "Documentation".
- Follow the instructions that follow adding the relevant prose description and code chunks to the corresponding sections.
- Make sure to provide descriptions of your steps between code chunks and code comments within the code chunks!
- Read information about the ACTIV-ES Corpus
- Quick description of the
plain
data - View one or two of the
.run
files in thedata/original/actives/
directory - Propose an idealized tidy dataset structure (use
tribble()
function) where the unit of analysis is 'sentence'.
- Read the
.run
corpus files into the R session using thereadtext()
function. - Inspect and provide a prose description of the structure of the resulting object.
Metadata
- Curate the metadata found in the
doc_id
column of the data frame. Use theseparate()
function to segment the values found indoc_id
by underscores_
into seven new columns corresponding to the metadata found in each.- Preview the dataset structure to ensure that the process was successful.
- Next, clean the
title
andimbd_id
columns:title
contains hyphens-
. Replace all the hyphens with a whitespace. You will most likely use themutate()
function to create a new column (overwriting the existing column) and thestr_replace_all()
function to find the hyphens and replace them with whitespace.imdb_id
contains a trailing.run
on each of the ids. Remove this information leaving only the imdb ID. Again usemutate()
to overwrite the existingimdb_id
column and use thestr_remove()
function to remove the.run
characters. (Note: you will need to escape the.
as it has a special meaning in regular expressions!)- Preview the dataset structure to ensure that the process was successful.
Text
- Curate the
text
column by segmenting the individual TV/ Movie transcripts into sentences. You will use theunnest_tokens()
function from the tidytext package.- Specify the input and output columns, use the
token = 'sentences'
argument-value to segment the text into sentences, and include the argument-valueto_lower = FALSE
to avoid lowercasing the text in the output. - Preview the dataset structure to ensure that the process was successful.
- Specify the input and output columns, use the
- Inspect the overall structure of the dataset using
glimpse()
. Report the number of columns (i.e. sentences) contained in the curated dataset.
- Write the curated dataset to disk as a
.csv
file. Add this file to thedata/derived/actives/
directory.- Note you will need to create a subdirectory (
data/derived/actives/
) using thefs::dir_create()
function.
- Note you will need to create a subdirectory (
You can add a preview of the structure of the data/derived/
directory using the following code inside a code chunk.
fs::dir_tree("data/derived/")
- Use the
data_dic_starter()
function that was sourced from thefunctions/functions.R
file to create the starter documentation file. Be sure to name your documentation file so that it is clear that this data dictionary file corresponds to the curated dataset you've just created.- Make sure to add
eval=FALSE
to the code chunk that creates the documentation starter file. This will ensure that when you knit this R Markdown document in the future, it will not overwrite the updates to this file that you will perform in the next steps!
- Make sure to add
- Download the starter documentation
.csv
file from RStudio Cloud to your computer and edit this.csv
file in a spreadsheet software (such as MS Excel or Apple Numbers) adding the relevant documentation information. - After updating this
.csv
file in spreadsheet software save it as a.csv
and upload it to RStudio Cloud, overwriting the original starter documentation. - Read the updated documentation
.csv
file and print the table structure to your R Markdown output.
Now that you have conducted the steps to curate and document the ACTIVE-ES corpus files, provide a prose overview of what the goals of this script are and resulting data structure and files created.
Add a level-1 section which describes your learning in this lab.
Some questions to consider:
- What did you learn?
- What was most/ least challenging?
- What resources did you consult?
- What more would you like to know about?
- To prepare your lab report for submission on Canvas you will need to Knit your R Markdown document to PDF or Word.
- Note since the data/ dataset in this lab includes accented characters (Spanish), you will need to change the latex engine if you knit this document to a PDF file. To do this use the RStudio shortcut button to the 'Output options...' and select format output 'PDF', then select 'Advanced' and choose 'xelatex' as the latex engine.
- Download this file to your computer.
- Go to the Canvas submission page for Lab #7 and submit your PDF/Word document as a 'File Upload'. Add any comments you would like to pass on to me about the lab in the 'Comments...' box in Canvas.