Textual Analysis

This is a continuation from my last venture into learning more Python to figure out how to handle my Magical Movies project.

tldr; I want to take the list of magical movies as suggested by Redditors, watch them, and see how 'magical' they actually turn out to be.

I followed this tutorial about textual analysis with the Natural Language Toolkit.

The tutorial walked me through the ultimate goal of counting the number of times a character appears in a text. They used Harry Potter and the Sorcerer's Stone, and I decided to use Heart of Darkness.

To do this, we had to:

Read the text - taken from Project Gutenberg
'Tokenize' the text - break the text into individual words and punctuation (basically a split function)
'Tag' the text - associate each of the tokenized words with a tag (part of speech, punctuation, etc.) for the NLTK algorithm
Find the proper nouns based on the tags in the previous step.

There was a good point brought up that some proper nouns actually came as two words, so we had to add some tests to check what the next word was.

Finally, count the number of times each proper noun existed in the list.
I added the extra step of printing the results to a text file that was easier to read.

  ’: 135
  kurtz: 50
  project gutenberg-tm: 48
  mr. kurtz: 47
  ‘: 34
  project gutenberg: 27
  foundation: 23
  ’ ‘: 21
  ’ “: 16
  united: 15
  kurtz ’: 14
  literary archive: 13
  license: 9
  well: 9
  don ’: 8
  europe: 8
  did: 8
  * *: 7
  my: 7
  ah: 7

The author has some reasoning to not exclude punctuation in the beginning because of the way the NLTK algorithm understands a list of names.

An important note here: people sometimes remove punctuation when doing a textual analysis, but here’s why I didn’t. Consider this commonly occurring phrase in all of the Potter books: “Harry, Ron and Hermione”. The comma here is pretty important used in conjunction with NLP. If I leave the sentence as-is, NLP reads it as NNP — , — NNP — CC — NNP , where CC is a conjunction. But, if I take out the comma, I get NNP — NNP — CC — NNP . With this input, my algorithm above would read ‘Harry Ron’ as a single noun, like ‘Uncle Vernon’.

Without the comma, 'Harry, Ron...' would become 'Harry Ron...' which my code would pick up as a single proper noun (like someone with two first names).

I'm wondering now if I could do this on the list of proper nouns. Apparently apostrophes and quotation marks are some of the most prominent characters in the text. But if you've read Heart of Darkness, you'd know that it's probably Kurtz, who does come in second.

How is this going to help me?

Right now I have a bunch of comments that may or may not have a movie title in them. I thought the IDs I got from Pushshift were only for top level comments, but after looking over some, it seems like some might have been replies to other comments (eg. 'I really have to see that one!').

I could try to find proper nouns here too, but movie titles can have more than two nouns in them, and I'm not sure how to handle that yet.

For now, I'll keep reading the NLTK Book and looking for other examples of textual analysis

Other Projects

Check out other stuff I've worked on:

Minute To Win It Games API & Wiki: https://github.com/geraldiner/min-to-win

Rehabitter: https://github.com/geraldiner/rehabitter

Snapchat Clone: https://github.com/geraldiner/snapchat-clone

K.K. Radio: https://github.com/geraldiner/kk-radio

Let's Connect!

Let's talk about self-taught programming, experience design, (computer science) education, and/or Animal Crossing:

Twitter: @GeraldineDesu

LinkedIn: in/GeraldineR

Email: hello [at] geraldiner [dot] com

Currently working full-time at Nom Nom, but always open to any cool, interesting projects!

geraldiner / textual-analysis

Textual Analysis

How is this going to help me?

Other Projects

Let's Connect!

About

Languages