title: Text Mining for Social Sciences author: Nandan Rao date: April, 2019 ...
- Informatiion Retrieval
- NLP
- Preprocessing Preprocessing Preprocessing
When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need, but not a more general one. \hfill (Vladimir Vapnik)
Statistical Modelling: The Two Cultures
https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726
- Information retrieval
$\approx$ search. - One of the basic, early problems of internet engineering and information organization.
- Many of the tools we use in NLP were created for this problem.
- You have a corpus of documents (for example: the internet). You have a user who wants a few of these documents. How do you design this system?
Let's say you are inventing search. Imagine someone searching for the term "People who see ghosts". How could you pick between the following?
- This is a document about people who see ghosts. Those people end up on TV shows.
- This is a document about seeing goats. Those people work on farms.
Let's try again with the term: "People who see ghosts"
"I don't believe people who see ghosts", said Mannie, before spitting into the wind and riding his bike down the street at top speed. He then went home and ate peanut-butter and jelly sandwiches all day. Mannie really liked peanut-butter and jelly sandwiches. He ate them so much that his poor mother had to purchase a new jar of peanut butter every afternoon.
We have collected a report of every resident in our community that has seen a ghost. Each resident was asked "how many ghosts have you seen?", "describe the last ghost you saw", and "tell us about your mother." Afterwards, we compared the ghost reports between the different individuals, and assessed whether or not they had actually seen these apparitions.
Let's try again with the term: "People who see ghosts"
"I don't believe people who see ghosts", said Mannie, before spitting into the wind and riding his bike down the street at top speed. He then went home and ate peanut-butter and jelly sandwiches all day. Mannie really liked peanut-butter and jelly sandwiches. He ate them so much that his poor mother had to purchase a new jar of peanut butter every afternoon.
We have collected a report of every resident in our community that has seen a ghost. Each resident was asked "how many ghosts have you seen?", "describe the last ghost you saw", and "tell us about your mother." Afterwards, we compared the ghost reports between the different individuals, and assessed whether or not they had actually seen these apparitions.
- Frequency matters!
- Let's try and count the frequency of each word
\textbf{Stop words} "seen a ghost"
\textbf{Stemming} "seen a ghost"
\textbf{Lemmatization} "saw ghosts"
\textbf{Tokenization} "see ghost"
We might need some concept of synonyms.
- ghost, apparitions, spook
$\rightarrow$ ghost - people, individuals, residents, folk
$\rightarrow$ people
Are these actually synonyms?
Now let's try our tools on the following text:
People see incredible things. One time I saw some people talking about things they had seen, and those people were so much fun. They saw clouds and they saw airplanes. Can you believe the amount of seeing done by these people? People are the best.
Let
The inverse document frequency is
where
Properties:
- Higher weight for words in fewer documents.
- Log dampens effect of weighting.
For words which are more common, we lower their weights.
(example)
Words which appear in \textit{many} of the documents are not going to help us pick \textit{one} document.
What is Natural Language Processing?
- https://en.wikipedia.org/wiki/Natural-language_processing#History
- https://www.cl.cam.ac.uk/archive/ksj21/histdw4.pdf
Two large challenges of Natural Language Processing:
- Put language into a metric space.
- Deal with the complex correlations between words in a sentence, and sentences in a document.
A metric space consists of a set (we'll call them documents in this context) and a distance metric between items in the set.
- What are some possible measures of "distance" between two documents?
(word count example)
Now that we have our data into a numeric form, how can we determine a distance?
What about the distance between these two documents?
-
We have collected a report of every resident in our community that has seen a ghost. Each resident was asked "how many ghosts have you seen?", "describe the last ghost you saw", and "tell us about your mother." Afterwards, we compared the ghost reports between the different individuals, and assessed whether or not they had actually seen these apparitions.
-
We ask each resident how many ghosts they've seen.
We might want a distance that ignores the "size" of the document.
One option is to normalize our vectors to unit length, this has the advantage of keeping the "direction" while removing the size element. Once we normalize our vectors, the euclidian distance becomes proportional to:
Where
What about the similarity of these two documents:
People who see ghosts are full of crap. I don't believe a word they say. They didn't actually see any ghosts. No way! They are just seeing things.
We talked to lots of people who have seen ghosts. Each person was asked "how many ghosts have you seen?" They had a lot of interesting and disturbing stories about the ghosts in their lives.
With the previous example, stemming/lemmatization + NGrams + TF-IDF would yield a feature:
"see ghost"
which would most likely be very highly weighted (depending on the corpus). This would help these two documents to be very similar, even though they are not in the simple BOW space.
Similarly, in a 2-gram space these two documents separate:
Sometimes, at my job, I use text mining.
Sometimes, at my mining job, I text.
"at my job, I use text mining"
["at my", "my job", "job I",
"I use", "use text", "text mining"]
"at my mining job, I text"
["at my", "my mining", "mining job",
"I text"]
What about a continuous metric space?
Co-occurences can be used as a proxy for semantic similarity.
(embedding example)
Other ways to think about semantic structure of a sentence?
Grammar???
- In an attempt to create conversations, computer scientists brought in linguists.
- There was a need to understand the semantic content of sentences.
How can we differentiate between these documents?
- France: Migrant stabbed to death in Calais
- Afghan asylum seeker stabbed to death in London park
- Clashes in Istanbul after angry mourners of a Turkish man is stabbed to death by an Afghani refugee
- German woman stabbed to death by Syrian refugee on her doorstep
- In memory to Bangladeshi migrant #Manan stabbed to death 6y ago during pogrom orchestrated by Nona's
- great people? the people that kicked over jugs of water to let migrants die in the desert? those are not great people.
https://github.com/nandanrao/text-mining/blob/master/dependency-tree-example.ipynb
title: Classifying Text author: Nandan Rao date: April, 2019 ...
- Emprical risk minimization and
$p(y,x)$ -
$p(x)$ when$X$ is language - Generative vs discriminative classifiers
- Hyperparameters in BOW framework
- Towards supervising the embedded space
Consider an input space
The risk of the classifier
$$ R(g) = \mathbb{E}{X,Y} [ {\ell(g(x), y)} ] $$ $$ R(g) = \sum{y \in Y} \int_X \ell(g(x), y) \ p(x,y) \ dx $$
We estimate the risk with its finite-sample approximation, the empirical risk:
Note, that this is only an approximation of
Implies that
Ways to deal with changes in any of the component parts of the joint distribution are covered in the literature of domain adaptation.
What is
Again, we will consider that each
Is the document space continuous?
Hopefully you see why for most tasks related to NLP, we want the space to be continuous.
Thus, in order to classify, we need to embed the documents into a continous space, with a continuous distance metric.
Generative classifiers seek to model the join probability
With a model built of the joint probability, the classification simply consists of applying bayes rule to the joint probability and picking the class
Discriminative classifiers either:
- Directly model the posterior,
$p(Y|X)$ - Learn a direct mapping
$g: X \rightarrow Y$
Logistic regression models the posterior probability as:
where the logistic function
This can be fit via maximum likelihood ($\ell := -p(y|x; \beta)$) or by any other convex surragote of the 0-1 error (
As mentioned, Naive Bayes predicts based on the posterior calculated from the modelled joint distribution:
It's called "naive" because we make the (extreme) simplifying assumption that all the
What does this independence assumption amount to saying about language?
Let's see how a multinomial naive Bayes can be related to logistic regression:
Where
Thus the binary multinomial naive Bayes is a linear classifier and, in particular, one in which the coefficients are equal to the log difference in probabilities that a particular feature shows up in each class.
Logistic regression searches the space of linear classifiers in the feature space, finding the linear classifier with the lowest empirical risk in the training set.
Naive Bayes is a specific linear classifier that makes distributional assumptions over the data generating process
It should be clear that, asymptotically, logistic regression will converge to the optimal linear classifier. It's not clear that Naive Bayes will do the same, if the distributional assumptions do not hold. Thus: $$ R(g_{LR, \infty}) \leq R(g_{NB, \infty}) $$
However, making more assumptions allows Naive Bayes to reach it's asymptotic risk quicker, as shown by Andrew Ng and Michael Jordan:
https://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
Let's try this on our data.
A hyperparameter is any parameter that the optimization process of the our learning algorithm does not optimize over.
Thus in logistic regression, the regularization term,
Our training algorithm will pick the parameters that minimize the in-sample empirical risk.
We pick hyperparameters to minimize the out-of-sample empirical risk.
This usually is done either by changing the class of models searched by the algorithm or by changing the loss function to reduce the generalization error.
In our simple language models, however, it should be clear that we have hyperparameters that we want to tune that seem to have nothing to do with our model.
For example, all the parameters in the vectorizing step.
These parameters guide the way that our documents are embedded in the metric space.
We want this embedding to be the best that it can be, such that it minimizes our expected risk.
How can we pick good hyperparameters?
The simplest way to tune hyperparameters is to make no assumptions about them and just search the entire space, hoping to find ones that are relatively good.
This is, interestingly enough, very common!
It's called grid search.
It should be clear, however, that this is not ideal.
Is it possible to jointly optimize the way that we embed language into a metric space AND the classification function at the same time?
title: Text Mining for Social Sciences author: Nandan Rao date: April, 2019 ...
This is hot stuff.
Some examples:
\includegraphics[width=\textwidth]{assets/missingmigrants}
- The Missing Migrants project is essentially one of dataset creation.
- These statistics are not available from any single state.
- News agencies report the events. In plain text!
\includegraphics[width=\textwidth]{assets/newsfilter}
What is the effect of the explosion of freelancing websites on the labor market? Demand side:
-
"I need an experienced Business Strategist who can write content explaining all the important moving parts and pieces of building a business plan and/or business model. You'll be explaining to first time entrepreneurs and small business owners and diving into the importance"
-
"We are Ricardo Steak House Restaurant located in Harlem, New York. We are looking for an expert opinion and training on how to manage our accounting department"
-
"We are an 8Mil per year trucking company based out of NJ. Due to negative loss-runs, we lost ideal market coverage for insurance and forced to use Progressive Commercial. We need someone with both an accounting background and deep knowledge of commercial insurance..."
How can we measure the growing share of artificial intelligence in worldwide innovation?
Corpora: Github, Patents.
One-class classification
Keep classification fixed, improve embedding.
How can we use online labor markets as a realtime view on shifts in jobs and skills demanded?
How do the returns to tasks change over time?
What are tasks? Created manually? Extract from text?