Devlopment Moved to clustext Package -CLICK HERE-
hclustext is a collection of optimized tools for clustering text data via hierarchical clustering. There are many great R clustering tools to locate topics within documents. I have had success with hierarchical clustering for topic extraction. This package wraps many of the great R tools for clustering and working with sparse matrices to aide in the workflow associated with topic extraction.
The general idea is that we turn the documents into a matrix of words. After this we weight the terms by importance using tf-idf. This helps the more salient words to rise to the top. We then apply cosine distance measures to compare the terms (or features) of each document. Cosine distance works well with sparse matrices to produce distances metrics between the documents. The hierarchical clustering is fit to separate the documents into clusters. The user then may apply k clusters to the fit, clustering documents with similar important text features. The documents can then be grouped by clusters and their accompanying salient words extracted as well.
The main functions, task category, & descriptions are summarized in the table below:
Function | Category | Description |
---|---|---|
data_store |
data structure | hclustext's data structure (list of dtm + text) |
hierarchical_cluster |
cluster fit | Fits a hierarchical cluster model |
assign_cluster |
assignment | Assigns cluster to document/text element |
get_text |
extraction | Get text from various hclustext objects |
get_dtm |
extraction | Get tm::DocumentTermMatrix from various hclustext objects |
get_removed |
extraction | Get removed text elements from various hclustext objects |
get_terms |
extraction | Get clustered weighted important terms from an assign_cluster object |
get_documents |
extraction | Get clustered documents from an assign_cluster object |
To download the development version of hclustext:
Download the zip
ball or tar
ball, decompress
and run R CMD INSTALL
on it, or use the pacman package to install
the development version:
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(
"trinker/textshape",
"trinker/gofastr",
"trinker/termco",
"trinker/hclustext"
)
You are welcome to:
- submit suggestions and bug-reports at: https://github.com/trinker/hclustext/issues
- send a pull request on: https://github.com/trinker/hclustext/
- compose a friendly e-mail to: tyler.rinker@gmail.com
if (!require("pacman")) install.packages("pacman")
pacman::p_load(hclustext, dplyr, textshape, ggplot2, tidyr)
data(presidential_debates_2012)
The data structure for hclustext is very specific. The
data_storage
produces a DocumentTermMatrix
which maps to the
original text. The empty/removed documents are tracked within this data
structure, making subsequent calls to cluster the original documents and
produce weighted important terms more robust. Making the data_storage
object is the first step to analysis.
We can give the DocumentTermMatrix
rownames via the doc.names
argument. If these names are not unique they will be combined into a
single document as seen below. Also, if you want to do stemming, minimum
character length, stopword removal or such this is when/where it's done.
ds <- with(
presidential_debates_2012,
data_store(dialogue, doc.names = paste(person, time, sep = "_"))
)
ds
## Text Elements : 10
## Elements Removed : 0
## Documents : 10
## Terms : 3,369
## Non-/sparse entries: 7713/25977
## Sparsity : 77%
## Maximal term length: 16
Next we can fit a hierarchical cluster model to the data_store
object
via hierarchical_cluster
.
myfit <- hierarchical_cluster(ds)
myfit
##
## Call:
## fastcluster::hclust(d = cosine_distance(x[["dtm"]]), method = method)
##
## Cluster method : ward.D
## Number of objects: 10
This object can be plotted with various k
or h
parameters specified
to experiment with cutting the dendrogram. This cut will determine the
number of clusters or topics that will be generated in the next step.
The visual inspection allows for determining how to cluster the data as
well as determining if a tf-idf, cosine, hierarchical cluster model is a
right fit for the data and task. By default plot
uses an approximation
of k
based on Can & Ozkarahan's (1990) formula (m * n)/t where
m and n are the dimensions of the matrix and t is the length of
the non-zero elements in matrix A.
- Can, F., Ozkarahan, E. A. (1990). Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Transactions on Database Systems 15 (4): 483. doi:10.1145/99935.99938
Interestingly, in the plots below where k = 6
clusters, the model
groups each of the candidates together at each of the debate times.
plot(myfit)
##
## k approximated to: 4
plot(myfit, k=6)
plot(myfit, h = .75)
The assign_cluster
function allows the user to dictate the number of
clusters. Because the model has already been fit the cluster assignment
is merely selecting the branches from the dendrogram, and is thus very
quick. Unlike many clustering techniques the number of clusters is done
after the model is fit, this allows for speedy cluster assignment,
meaning the user can experiment with the number of clusters.
ca <- assign_cluster(myfit, k = 6)
ca
## CROWLEY_time 2 LEHRER_time 1 OBAMA_time 1 OBAMA_time 2
## 1 2 3 4
## OBAMA_time 3 QUESTION_time 2 ROMNEY_time 1 ROMNEY_time 2
## 5 6 3 4
## ROMNEY_time 3 SCHIEFFER_time 3
## 5 2
To check the number of documents loading on a cluster there is a
summary
method for assign_cluster
which provides a descending data
frame of clusters and counts. Additionally, a horizontal bar plot shows
the document loadings on each cluster.
summary(ca)
## cluster count
## 1 2 2
## 2 3 2
## 3 4 2
## 4 5 2
## 5 1 1
## 6 6 1
The user can grab the texts from the original documents grouped by
cluster using the get_text
function. Here I demo a 40 character
substring of the document texts.
get_text(ca) %>%
lapply(substring, 1, 40)
## $`1`
## [1] "Good evening from Hofstra University in "
##
## $`2`
## [1] "We'll talk about specifically about heal"
## [2] "Good evening from the campus of Lynn Uni"
##
## $`3`
## [1] "Jim, if I if I can just respond very qui"
## [2] "What I support is no change for current "
##
## $`4`
## [1] "Jeremy, first of all, your future is bri"
## [2] "Thank you, Jeremy. I appreciate your you"
##
## $`5`
## [1] "Well, my first job as commander in chief"
## [2] "Thank you, Bob. And thank you for agreei"
##
## $`6`
## [1] "Mister President, Governor Romney, as a "
As with many topic clustering techniques, it is useful to get the to
salient terms from the model. The get_terms
function uses the
min-max
scaled, tf-idf weighted,
DocumentTermMatrix
to extract the most frequent salient terms. These
terms can give a sense of the topic being discussed. Notice the absence
of clusters 1 & 6. This is a result of only a single document included
in each of the clusters. The term.cutoff
hyperparmeter sets the lower
bound on the min-max scaled tf-idf to accept. If you don't get any terms
you may want to lower this or reduce min.n
. Likewise, these two
parameters can be raised to eliminate noise.
get_terms(ca, .075)
## $`2`
## term n
## 1 each 2
## 2 gentlemen 2
## 3 go 2
## 4 leave 2
## 5 minutes 2
## 6 mister 2
## 7 night 2
## 8 romney 2
## 9 segment 2
## 10 segue 2
## 11 sir 2
## 12 statements 2
##
## $`3`
## term n
## 1 banks 2
## 2 board 2
## 3 care 2
## 4 federal 2
## 5 health 2
## 6 insurance 2
## 7 medicare 2
## 8 plan 2
## 9 republican 2
## 10 that's 2
## 11 they 2
##
## $`4`
## term n
## 1 coal 2
## 2 immigration 2
## 3 jobs 2
## 4 oil 2
## 5 production 2
## 6 sure 2
## 7 that's 2
## 8 women 2
##
## $`5`
## term n
## 1 home 2
## 2 iran 2
## 3 israel 2
## 4 military 2
## 5 nuclear 2
## 6 sanctions 2
## 7 stand 2
## 8 sure 2
## 9 they 2
## 10 threat 2
## 11 troops 2
Here I plot the clusters, terms, and documents (grouping variables) together as a combined heatmap. This can be useful for viewing & comparing what documents are clustering together in the context of the cluster's salient terms. This example also shows how to use the cluster terms as a lookup key to extract probable salient terms for a given document.
key <- data_frame(
cluster = 1:6,
labs = get_terms(ca, .085) %>%
bind_list("cluster") %>%
select(-n) %>%
group_by(cluster) %>%
summarize(term=paste(term, collapse=", ")) %>%
apply(1, paste, collapse=": ") %>%
c("1:", ., "6:")
)
ca %>%
bind_vector("id", "cluster") %>%
separate(id, c("person", "time"), sep="_") %>%
tbl_df() %>%
left_join(key) %>%
mutate(n = 1) %>%
mutate(labs = factor(labs, levels=rev(key[["labs"]]))) %>%
unite("time_person", time, person, sep="\n") %>%
select(-cluster) %>%
complete(time_person, labs) %>%
mutate(n = factor(ifelse(is.na(n), FALSE, TRUE))) %>%
ggplot(aes(time_person, labs, fill = n)) +
geom_tile() +
scale_fill_manual(values=c("grey90", "red"), guide=FALSE) +
labs(x=NULL, y=NULL)
## Joining by: "cluster"
The get_documents
function grabs the documents associated with a
particular cluster. This is most useful in cases where the number of
documents is small and they have been given names.
get_documents(ca)
## $`1`
## [1] "CROWLEY_time 2"
##
## $`2`
## [1] "LEHRER_time 1" "SCHIEFFER_time 3"
##
## $`3`
## [1] "OBAMA_time 1" "ROMNEY_time 1"
##
## $`4`
## [1] "OBAMA_time 2" "ROMNEY_time 2"
##
## $`5`
## [1] "OBAMA_time 3" "ROMNEY_time 3"
##
## $`6`
## [1] "QUESTION_time 2"
I like working in a chain. In the setup below we work within a
magrittr pipeline to fit a model, select clusters, and examine the
results. In this example I do not condense the 2012 Presidential Debates
data by speaker and time, rather leaving every sentence as a separate
document. On my machine the initial data_store
and model fit take ~5-8
seconds to run. Note that I do restrict the number of clusters (for
texts and terms) to a random 5 clusters for the sake of space.
.tic <- Sys.time()
myfit2 <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster()
difftime(Sys.time(), .tic)
## Time difference of 7.707485 secs
## View Document Loadings
ca2 <- assign_cluster(myfit2, k = 100)
summary(ca2) %>%
head(12)
## cluster count
## 1 7 692
## 2 3 368
## 3 33 133
## 4 5 106
## 5 59 67
## 6 8 57
## 7 61 51
## 8 53 48
## 9 13 47
## 10 27 41
## 11 38 40
## 12 12 37
## Split Text into Clusters
set.seed(3); inds <- sort(sample.int(100, 5))
get_text(ca2)[inds] %>%
lapply(head, 10)
## $`17`
## [1] "One last point I want to make."
## [2] "Now, the last point I'd make before|"
## [3] "They put a plan out."
## [4] "They put out a plan, a bipartisan plan."
## [5] "Let me make one last point."
## [6] "And Governor Romney's says he's got a five point plan?"
## [7] "Governor Romney doesn't have a five point plan."
## [8] "He has a one point plan."
## [9] "My five point plan does it."
## [10] "But the last point I want to make is this."
##
## $`32`
## [1] "I think this is a great example."
## [2] "I think something this big, this important has to be done on a bipartisan basis."
## [3] "Governor Romney said this has to be done on a bipartisan basis."
## [4] "This was a bipartisan idea."
## [5] "This is a this is an important election and I'm concerned about America."
## [6] "I I know this is bigger than an election about the two of us as individuals."
## [7] "It's an election about the course of America."
## [8] "Well, think about what the governor think about what the governor just said."
## [9] "This is not just a women's issue, this is a family issue, this is a middle class issue, and that's why we've got to fight for it."
## [10] "Mister President why don't you get in on this quickly, please?"
##
## $`38`
## [1] "I will make sure we don't hurt the functioning of our of our marketplace and our business, because I want to bring back housing and get good jobs."
## [2] "And hard pressed states right now can't all do that."
## [3] "And everything that I've tried to do, and everything that I'm now proposing for the next four years in terms of improving our education system or developing American energy or making sure that we're closing loopholes for companies that are shipping jobs overseas and focusing on small businesses and companies that are creating jobs here in the United States, or closing our deficit in a responsible, balanced way that allows us to invest in our future."
## [4] "But not just jobs, good paying jobs."
## [5] "I want to do that in industries, not just in Detroit, but all across the country and that means we change our tax code so we're giving incentives to companies that are investing here in the United States and creating jobs here."
## [6] "I expect those new energy sources to be built right here in the United States."
## [7] "This is about bringing good jobs back for the middle class of America, and that's what I'm going to do."
## [8] "And that's creating jobs."
## [9] "When you've got thousands of people right now in Iowa, right now in Colorado, who are working, creating wind power with good paying manufacturing jobs, and the Republican senator in that in Iowa is all for it, providing tax breaks to help this work and Governor Romney says I'm opposed."
## [10] "Candy, I don't have a policy of stopping wind jobs in Iowa and that they're not phantom jobs."
##
## $`58`
## [1] "And the question is this."
## [2] "Your question your question is one that's being asked by college kids all over this country."
## [3] "Mister President, the next question is going to be for you here."
## [4] "and the next question."
## [5] "And the next question is for you."
## [6] "Governor, this question is for you."
## [7] "And Mister President, the next question is for you, so stay standing."
## [8] "And it's Katherine Fenton, who has a question for you."
## [9] "Well, Katherine, that's a great question."
## [10] "But the president does get this question."
##
## $`80`
## [1] "Natural gas production is the highest it's been in decades."
## [2] "We have seen increases in coal production and coal employment."
## [3] "Look, I want to make sure we use our oil, our coal, our gas, our nuclear, our renewables."
## [4] "But what we don't need is to have the president keeping us from taking advantage of oil, coal and gas."
## [5] "This has not been Mister Oil, or Mister Gas, or Mister Coal."
## [6] "I was in coal country."
## [7] "The head of the EPA said, You can't build a coal plant."
## [8] "And natural gas isn't just appearing magically."
## [9] "And when I hear Governor Romney say he's a big coal guy, I mean, keep in mind, when Governor, when you were governor of Massachusetts, you stood in front of a coal plant and pointed at it and said, This plant kills, and took great pride in shutting it down."
## [10] "With respect to something like coal, we made the largest investment in clean coal technology, to make sure that even as we're producing more coal, we're producing it cleaner and smarter."
## Get Associated Terms
get_terms(ca2, term.cutoff = .07)[inds]
## $`17`
## term n
## 1 point 7
## 2 plan 6
## 3 president's 2
##
## $`32`
## term n
## 1 issue 4
## 2 please 4
## 3 election 3
## 4 think 3
## 5 bipartisan 2
## 6 mistake 2
##
## $`38`
## term n
## 1 investing 3
## 2 investments 3
## 3 companies 2
## 4 jobs 2
##
## $`58`
## term n
## 1 question 5
## 2 katherine 2
## 3 next 2
##
## $`80`
## term n
## 1 coal 4
## 2 natural 3
It seems to me that if the hierarchical clustering is function as expected we'd see topics clustering together within a conversation as the natural eb and flow of a conversation is to talk around a topic for a while and then move on to the next related topic. A Gantt style plot of topics across time seems like an excellent way to observe clustering across time. In the experiment I first ran the hierarchical clustering at the sentence level for all participants in the 2012 presidential debates data set. I then decided to use turn of talk as the unit of analysis. Finally, I pulled out the two candidates (President Obama and Romney) and faceted on their topic use over time.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, hclustext, textshape, ggplot2, stringi)
myfit3 <- presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
with(data_store(dialogue)) %>%
hierarchical_cluster()
plot(myfit3, 75)
Can & Ozkarahan's (1990) formula indicated a k = 259
. This number
seemed overly large. I used k = 75
for the number of topics as it
seemed unreasonable that there'd be more topics than this but with
k = 75
over half of the sentences loaded on one cluster. Note the use
of the attribute
join
from assign_cluster
to make joining back to
the original data set easier.
k <- 75
ca3 <- assign_cluster(myfit3, k = k)
presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
tbl_df() %>%
attributes(ca3)$join() %>%
group_by(time) %>%
mutate(
word_count = stringi::stri_count_words(dialogue),
start = starts(word_count),
end = ends(word_count)
) %>%
na.omit() %>%
mutate(cluster = factor(cluster, levels = k:1)) %>%
ggplot2::ggplot(ggplot2::aes(x = start-2, y = cluster, xend = end+2, yend = cluster)) +
ggplot2::geom_segment(ggplot2::aes(position="dodge"), color = 'white', size = 3) +
ggplot2::theme_bw() +
ggplot2::theme(panel.background = ggplot2::element_rect(fill = 'grey20'),
panel.grid.minor.x = ggplot2::element_blank(),
panel.grid.major.x = ggplot2::element_blank(),
panel.grid.minor.y = ggplot2::element_blank(),
panel.grid.major.y = ggplot2::element_line(color = 'grey35'),
strip.text.y = ggplot2::element_text(angle=0, hjust = 0),
strip.background = ggplot2::element_blank()) +
ggplot2::facet_wrap(~time, scales='free', ncol=1) +
ggplot2::labs(x="Duration (words)", y="Cluster")
## Joining by: "id_temporary"
Right away we notice that not all topics are used across all three
times. This is encouraging that the clustering is working as expected as
we'd expect some overlap in debate topics as well as some unique topics.
However, there were so many topics clustering on cluster 3 that I had to
make some decisions. I could (a) ignore this mass and essentially throw
out half the data that loaded on a single cluster, (b) increase k
to
split up the mass loading on cluster 3, (c) change the unit of analysis.
It seemed the first option was wasteful of data and could miss
information. The second approach could lead to a model that had so many
topics it wouldn't be meaningful. The last approach seemed reasonable,
inspecting the cluster text showed that many were capturing functions of
language rather than content. For example, people use "Oh." to
indicate agreement. This isn't a topic but the clustering would group
sentences that use this convention together. Combining this sentence
with other sentences in the turn of talk are more likely to get the
content we're after.
Next I used the textshape::combine
function to group turns of talk
together.
myfit4 <- presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
textshape::combine() %>%
with(data_store(dialogue, stopwords = tm::stopwords("english"), min.char = 3)) %>%
hierarchical_cluster()
plot(myfit4, k = 80)
The distribution of turns of talk looked much more dispersed across
clusters. I used k = 80
for the number of topics.
k <- 80
ca4 <- assign_cluster(myfit4, k = k)
presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
textshape::combine() %>%
tbl_df() %>%
attributes(ca4)$join() %>%
group_by(time) %>%
mutate(
word_count = stringi::stri_count_words(dialogue),
start = starts(word_count),
end = ends(word_count)
) %>%
na.omit() %>%
mutate(cluster = factor(cluster, levels = k:1)) %>%
ggplot2::ggplot(ggplot2::aes(x = start-2, y = cluster, xend = end+2, yend = cluster)) +
ggplot2::geom_segment(ggplot2::aes(position="dodge"), color = 'white', size = 3) +
ggplot2::theme_bw() +
ggplot2::theme(panel.background = ggplot2::element_rect(fill = 'grey20'),
panel.grid.minor.x = ggplot2::element_blank(),
panel.grid.major.x = ggplot2::element_blank(),
panel.grid.minor.y = ggplot2::element_blank(),
panel.grid.major.y = ggplot2::element_line(color = 'grey35'),
strip.text.y = ggplot2::element_text(angle=0, hjust = 0),
strip.background = ggplot2::element_blank()) +
ggplot2::facet_wrap(~time, scales='free', ncol=1) +
ggplot2::labs(x="Duration (words)", y="Cluster")
## Joining by: "id_temporary"
The plots looked less messy and indeed topics do appear to be clustering around one another. I wanted to see how the primary participants, the candidates, compared to each other in topic use.
In this last bit of analysis I filter out all participants except Obama and Romeny and facet by participant across time.
myfit5 <- presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
textshape::combine() %>%
filter(person %in% c("ROMNEY", "OBAMA")) %>%
with(data_store(dialogue, stopwords = tm::stopwords("english"), min.char = 3)) %>%
hierarchical_cluster()
plot(myfit5, 50)
Based on the dendrogram, I used k = 50
for the number of topics.
k <- 50
ca5 <- assign_cluster(myfit5, k = k)
presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
textshape::combine() %>%
filter(person %in% c("ROMNEY", "OBAMA")) %>%
tbl_df() %>%
attributes(ca5)$join() %>%
group_by(time) %>%
mutate(
word_count = stringi::stri_count_words(dialogue),
start = starts(word_count),
end = ends(word_count)
) %>%
na.omit() %>%
mutate(cluster = factor(cluster, levels = k:1)) %>%
ggplot2::ggplot(ggplot2::aes(x = start-10, y = cluster, xend = end+10, yend = cluster)) +
ggplot2::geom_segment(ggplot2::aes(position="dodge"), color = 'white', size = 3) +
ggplot2::theme_bw() +
ggplot2::theme(panel.background = ggplot2::element_rect(fill = 'grey20'),
panel.grid.minor.x = ggplot2::element_blank(),
panel.grid.major.x = ggplot2::element_blank(),
panel.grid.minor.y = ggplot2::element_blank(),
panel.grid.major.y = ggplot2::element_line(color = 'grey35'),
strip.text.y = ggplot2::element_text(angle=0, hjust = 0),
strip.background = ggplot2::element_blank()) +
ggplot2::facet_grid(person~time, scales='free', space='free') +
ggplot2::labs(x="Duration (words)", y="Cluster")
## Joining by: "id_temporary"
If you're curious about the heaviest weighted tf-idf terms in each
cluster the next code chunk provides the top five weighted terms used in
each cluster. If a cluster has ...
no terms met the minimum tf-idf
cut-off. Below this I provide a bar plot of the frequencies of clusters
to help put the other information into perspective.
invisible(Map(function(x, y){
if (is.null(x)) {
cat(sprintf("Cluster %s: ...\n", y))
} else {
m <- dplyr::top_n(x, 5, n)
o <- paste(paste0(m[[1]], " (", m[[2]], ")"), collapse="; ")
cat(sprintf("Cluster %s: %s\n", y, o))
}
}, get_terms(ca5, .02), names(get_terms(ca5, .02))))
## Cluster 1: medicare (5); back (2); topic (2)
## Cluster 2: can (3); get (3); just (3); candy (2); chance (2); going (2); indicated (2); major (2); mister (2); president (2); private (2); product (2); second (2); someone (2); spontaneous (2); still (2)
## Cluster 3: sorry (3)
## Cluster 4: absolutely (3)
## Cluster 5: dodd (3); frank (3); regulation (3); banks (2)
## Cluster 6: yes (3)
## Cluster 7: let (4); bob (2)
## Cluster 8: ...
## Cluster 9: first (3); one (3); way (3); become (2); came (2); cut (2); governor (2); israel (2); nation (2); number (2); office (2); sunday (2)
## Cluster 10: time (3); issue (2); used (2)
## Cluster 11: well (3)
## Cluster 12: respond (2)
## Cluster 13: ...
## Cluster 14: choice (2); economy (2); election (2); forward (2); whether (2)
## Cluster 15: companies (2); investing (2)
## Cluster 16: great (3)
## Cluster 17: done (3); leadership (3); get (2); role (2)
## Cluster 18: say (3); party (2)
## Cluster 19: energy (3)
## Cluster 20: yeah (3); good (2); thanks (2)
## Cluster 21: detroit (3); answer (2)
## Cluster 22: coal (3); oil (3); bunch (2); governor (2); iowa (2); jobs (2); wind (2)
## Cluster 23: cut (2); federal (2); land (2); licenses (2); permits (2); waters (2)
## Cluster 24: true (4)
## Cluster 25: cut (3); much (3)
## Cluster 26: question (5); answer (3)
## Cluster 27: right (2)
## Cluster 28: actually (3); got (2)
## Cluster 29: production (4); government (2); land (2)
## Cluster 30: governor (9); romney (2)
## Cluster 31: believe (2)
## Cluster 32: candy (5)
## Cluster 33: balance (2); budget (2); military (2); trillion (2)
## Cluster 34: women (3)
## Cluster 35: want (5); make (4); sure (4); immigration (2)
## Cluster 36: ...
## Cluster 37: lorraine (2)
## Cluster 38: pension (4); chinese (2); investments (2); looked (2); mister (2); outside (2); trust (2)
## Cluster 39: check (3); record (3)
## Cluster 40: pakistan (2)
## Cluster 41: act (3); attack (3); terror (3); day (2); garden (2); rose (2); said (2)
## Cluster 42: troops (4); agreement (2); forces (2); thought (2); thousand (2)
## Cluster 43: happy (3)
## Cluster 44: indicated (3)
## Cluster 45: syria (3)
## Cluster 46: ten (2); years (2)
## Cluster 47: iran (2)
## Cluster 48: industry (3); liquidate (3)
## Cluster 49: look (3); can (2); people (2)
## Cluster 50: wrong (4)
invisible(summary(ca5))
It appears that in fact the topics do cluster within segments of time as we'd expect. This is more apparent when turn of talk is used as the unit of analysis (document level) rather than each sentence.