- Read/ annotate: Recipe #9. You can refer back to this document to help you at any point during this lab activity.
- Note: do your best to employ what you've learned and use other existing resources (R documentation, web searches, etc.).
- Gain experience working with coding strategies to prepare, assess, interrogate, evaluate, and report results from an inferential data analysis.
- Practice transforming datasets and visualizing relationships
- Implement organizational strategies for organizing and reporting results in a reproducible fashion.
In this lab we will be working with the sdac_transformed
dataset that we've seen earlier. This dataset is based on the Switchboard Dialogue Act Corpus. We seen the process to curate and transform this dataset in previous chapters and recipes. An important feature of this dataset is the fact that it includes utterance-level discourse annotation. The convention used to annotate each utterance is called DAMSL which stands for Dialogue Act Markup using Several Layers (DAMSL). Here is the annotation documentation for the Switchboard Dialogue Act Corpus for reference.
The aim of this lab will be use the transformed dataset to analyze a particular alternative hypothesis:
Then, the null hypothesis is:
Background information for the analysis:
Lakoff (1973) argues that women express themselves tentatively without warrant or justification more often than men. This suggestion would predict that women will use more hedges than men. Holmes (1990) argues against this notion. What is a hedge? A hedge is used to diminish the confidence or certainty with which the speaker makes a statement or answers a question.
Examples of hedges:
(1) General example:
- I don't know if I'm making any sense or not.
(2) In context from the Switchboard Dialogue Act Corpus:
- You might try,
- I don't know,
- hold it down a little longer,
In the Switchboard Dialogue Act Corpus hedges are marked in the DAMSL tag annotation as h
(hedge), h^r
(repeated hedge), or h^t
(hedge when talking about the task).
With our theoretical aim in mind we will want to prepare, assess, interrogate, evaluate, and report results from this analysis.
- Create a new R Markdown document. Title it "Lab 9" and provide add your name as the author.
- Edit the front matter to have rendered R Markdown documents as you see fit (table of contents, numbered sections, etc.)
- Delete all the material below the front matter.
- Add a code chunk directly below the header named 'setup' and add the code to load the following packages and any others you end up using in this lab report. Add
message=FALSE
to this code chunk to suppress messages.
- tidyverse
- knitr
- skimr
- patchwork
- effectsize
- report
- also include
source()
to source thefunctions/functions.R
file. This will import theprint_pretty_table()
function.
NOTE Please pay attention to the formatting of your R Markdown output --particular in terms of the code chunk options (echo = FALSE
, message = FALSE
, etc.). Also use the print_pretty_table()
function for all of your table outputs. Include the following two arguments.
dataset %>% # dataset
print_pretty_table(caption = "<your caption here>") # pretty table with caption
- Create two level-1 header sections named: "Overview" and "Tasks".
- Under "Tasks" create six level-2 header sections named: "Orientation", "Preparation", "Descriptive assessment", "Statistical interrogation", "Evaluation" and "Reporting".
- Follow the instructions that follow adding the relevant prose description and code chunks to the corresponding sections.
- Make sure to provide descriptions of your steps between code chunks and code comments within the code chunks!
- Read the
sdac_transformed.csv
into an object calledsdac
andsdac_transformed_data_dictionary.csv
as an object calledsdac_data_dictionary
.- Preview the data structure of
sdac
and provide prose description of the dataset. - Print a table of the
sdac_data_dictionary
(useprint_pretty_table()
) and provide prose description of the data dictionary.
- Preview the data structure of
Note: we will include the variable age
in our analysis as a control factor to ensure that we account for any variability due to the age of the speakers.
- Modify the
sdac
dataset and create a new objectsdac_hedges
. This object will create a new columnhedges
that is the result of counting all utterances in which hedges occur. You will usemutate()
to create the new column and thestr_count()
function to match hedges. To help you out the regular expression to match all hedges will be^h(\\^r|\\^t)?
. - Sum and normalize the number of hedges used by each speaker. To do this you will group the variables
speaker_id
,sex
, andage
and then usesummarize()
to create a variablehedges_per_utt
. To sum and normalize the hedges, use(sum(hedges)/ n()) * 1000
inside thesummarize()
function. - Preview the new
sdac_hedges
dataset using theprint_pretty_table()
function. - Note that speaker 155 has incomplete
sex
andage
information. Remove this observation by usingfilter()
and overwritesdac_hedges
with the result. Our dataset will contain one less speaker now. - Finally, convert the variables
sex
andage
to factors usingmutate()
andfactor()
.- Preview the structure of the dataset
sdac_hedges
- Preview the structure of the dataset
- Use
skim()
to look at the categorical variablesex
(deselect the variablespeaker_id
as it is of no interest to our assessment).- Pull out only the factor-oriented information with
yank("factor")
. - Provide a prose description of the numeric results.
- Pull out only the factor-oriented information with
- Use the following custom skim function to look at the numeric variables
age
andhedges_per_utt
.num_skim <- skim_with(numeric = sfl(iqr = IQR)) # add IQR to skim
- Pull out only the numeric-oriented information with
yank("numeric")
. - Provide a prose description of the numeric results.
- Explore the distribution of the dependent variable
hedges_per_utt
by creating a histogram and a density plot. Combine them in the plotting space by assigning each plot to a variable (i.e.p1
andp2
) and then use the+
operator to display them both inline in the R Markdown output.- Provide a prose description of the visual results.
- As you will see, the distribution is right-skewed. But since
hedges_per_utt
is not discrete (not whole numbers) and can take a range of values (floating points, i.e. decimal places), let's explore a transformation of this variable known as the 'log transformation'. Create another density plot, but wrap thehedges_per_utt
variable with the functionlog()
.- Describe how the distribution has changed.
- (Optional) You may also want to create a QQ-Plot to see the distribution of
hedges_per_utt
compared to the theoretical normal distribution.- Describe the degree that the distribution visually conforms to the normal distribution.
- Create a new variable in our dataset called
hedges_per_utt_log
that applies a log transformation tohedges_per_utt
. Note that you will use the functionlog()
to create this variable, but you will need to add one to all observations to avoid undefined values whenlog(0)
. - Perform the Shapiro-Wilk Test of Normality on the new
hedges_per_utt_log
variable to verify whether it conforms to the normal distribution.- Hint: it will not, but we can visually see that the log transformation helps the distribution so we will proceed with the
hedges_per_utt_log
as our dependent variable.
- Hint: it will not, but we can visually see that the log transformation helps the distribution so we will proceed with the
- Create a numeric summary looking at the relationship between the variables we are going to add to our statistical model. Let's group our summaries by the categorical variable
sex
and usenum_skim()
andyank("numeric")
.- Provide a prose description of the numeric results.
- Create a scatter plot with the mappings
x = age
, y =hedges_per_utt_log
, andcolor = sex
. Thegeom_point()
function will create the points and thegeom_smooth()
will create the trend line and confidence interval ribbons. Inside thegeom_smooth()
addmethod = "lm"
to create a linear trend line.- Provide a prose description of the visual results.
- Conduct an Ordinary Least Squares Regression with the
lm()
function. Assign the result tom1
.- The formula will be
hedges_per_utt_log ~ age + sex
. - Return a summary of the results by running
summary()
on them1
object. - Provide a description of the results, focusing on the 'Coefficients' for our model variables. Remember that
age
is a control variable in this model!
- The formula will be
- Calculate the effect size and confidence intervals for our model predictors (independent and control variables) by using the
effectsize()
function. Assign the result toeffects
.- Preview the results of
effects
.
- Preview the results of
- Evaluate the effect size of our only significant predictor (
age
) by using theinterpret_r()
function on the correct value from theeffects
object (i.e.effects$Std_Coefficient[2]
).- Provide a prose description of the findings from our evaluation.
- Create the boilerplate text from our findings using the
report_text()
function on the statistical modelm1
- Create a table of the model results using the
report_table()
function on the statistical modelm1
Now that you have conducted the steps to analyze the dataset, provide a prose overview of what the goals of this script are and resulting findings are at the beginning of your script in the 'Overview' section.
Add a level-1 section which describes your learning in this lab.
Some questions to consider:
- What did you learn?
- What was most/ least challenging?
- What resources did you consult?
- What more would you like to know about?
- To prepare your lab report for submission on Canvas you will need to Knit your R Markdown document to PDF or Word.
- Note since the analysis contains some special characters, you will need to change the latex engine if you knit this document to a PDF file. To do this use the RStudio shortcut button to the 'Output options...' and select format output 'PDF', then select 'Advanced' and choose 'xelatex' as the latex engine.
- Download this file to your computer.
- Go to the Canvas submission page for Lab #9 and submit your PDF/Word document as a 'File Upload'. Add any comments you would like to pass on to me about the lab in the 'Comments...' box in Canvas.
Holmes, J. (1990). Hedges and boosters in women’s and men’s speech. Language & Communication, 10(3), 185–205. https://doi.org/10.1016/0271-5309(90)90002-S
Lakoff, R. (1973). Language and Woman’s Place. Language in Society, 2(1), 45–80.