- Read/ annotate: Recipe #3. You can refer back to this document to help you at any point during this lab activity.
- Read datasets from packages and from plain-text files
- Inspect and report characteristics of datasets
- Write datasets to a plain-text file (.csv or .tsv)
- Create a new R Markdown document. Title it "Lab #3" and provide add your name as the author.
- Edit the front matter to have rendered R Markdown documents print pretty tabular datasets:
output:
html_document:
df_print: kable
- Delete all the material below the front matter.
- Add a code chunk directly below the header named 'setup' and add the code to load the following packages
- tidyverse
- tadr
- Create two level-1 header sections named: "Package data" and "Local data".
- Follow the instructions that follow adding the relevant prose description and code chunks to the corresponding sections.
Remember:
- Add code comments (
# code comments...
) to your code lines to clarify what each step of your code does. - Use Markdown syntax as necessary to format your responses
- You can use keyboard shortcuts inside code chunks/ the R Console such as
- ⌥ - ('option + -') for the
<-
operator - ⇧⌘M ('shift + command + M') for the
%>%
operator - ⇥ ('tab') for code completion hints
- ⌥ - ('option + -') for the
Package data
You've already loaded the tadr package in the R session, now inspect the swda
dataset. Apply your conceptual and R programming knowledge to address the following points:
- Describe the type of data that is represented in the
swda
dataset. - Provide an overview ('glimpse' hint, hint) of the structure of the dataset (rows, columns, column types).
- Describe what the values in
dialect_area
represent. - Subset the dataset and only select the
doc_id
,speaker_id
,sex
, anddialect_area
columns. Assign the result to a new object (swda_sex_dialect
). - Show the first 10 rows of the
swda_sex_dialect
object. (Useslice_head
.) - Find out how many women versus men there are in this corpus. (Isolate the rows for which there are distinct value pairs for
speaker_id
andsex
, group bysex
, and then count the grouped rows.) - Find out which dialect area has the most represented speakers in this corpus. (Isolate the rows for which there are distinct value pairs for
speaker_id
anddialect_area
, group bydialect_area
, and then counting the grouped rows. You may want to arrange the results so that the largest counts (n
) appear in descending order.)
Local data
Now we are going to read a local dataset in plain-text format (endangered_languages.csv
) into an R session. This dataset is in the data/
directory. Apply your conceptual and R programming knowledge to address the following points:
- Read the
endangered_languages.csv
dataset in your R session and assign it to a new object (end_langs
). - Provide an overview of the structure of the dataset (rows, columns, column types).
- Visit the source of this data. Describe what this dataset represents.
- Find the distinct values for the
degree_of_endangerment
column. - Based on the source site, briefly describe what each value of
degree_of_endangerment
means. - Filter the dataset to only return the 'Vulnerable' languages and assign the result to a new object (
end_langs_vulnerable
). - With the new object (
end_langs_vulnerable
), identify the five 'Vulnerable' languages with the most speakers in theend_langs
dataset. - Write the
end_langs_vulnerable
object to a plain-text file (.csv or .tsv) in thedata/
directory. Give it the same name as the object with the extension (.csv or .tsv). - Copy and paste the following code into a new code chunk and provide some interpretation of what this plot shows or suggests.
end_langs_factor <-
end_langs %>%
mutate(degree_of_endangerment = factor(
degree_of_endangerment,
levels = c(
"Vulnerable",
"Definitely endangered",
"Severely endangered",
"Critically endangered",
"Extinct"
)
))
world_map <-
map_data(map = "world") %>%
filter(region != "Antarctica")
ggplot(data = world_map, aes(long, lat)) +
geom_polygon(aes(group = group), fill = "white", color = "grey50") +
geom_point(
data = end_langs_factor,
aes(x = longitude,
y = latitude,
color = degree_of_endangerment),
size = .5
) +
scale_color_discrete(direction = -1) +
theme(legend.position = "bottom", legend.direction = "vertical") +
labs(
title = "Endangered Languages",
y = "",
x = "",
color = "Degree of endangerment"
)
Add a section which describes your learning in this lab.
Some questions to consider:
- What did you learn?
- What was most/ least challenging?
- What resources did you consult?
- What more would you like to know about?
- To prepare your lab report for submission on Canvas you will need to Knit your R Markdown document to PDF or Word.
- Note: you will have to add the front matter line to pretty-print tables under the
pdf_document:
orpdf_document2:
(if you want cross-references to tables or figures) output.
output:
pdf_document:
df_print: kable
- Download this file to your computer.
- Go to the Canvas submission page for Lab #3 and submit your PDF/Word document as a 'File Upload'. Add any comments you would like to pass on to me about the lab in the 'Comments...' box in Canvas.