jncraton / shakespeare

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Shakespeare Text Analysis

This project provides an opportunity apply lists, dictionaries, and tuples to explore word use in the works of Shakespeare.

Tasks

You will create a program that can process an input text file and output the following:

  1. The total number of words in the text
  2. The number of unique words in the text

When counting unique words, it is expected that you will strip any punctuation from the left or right side of a word. For this purpose, the following characters should be stripped: ,.?!'":-&;.

In addition to stripping punctuation, words should also be counted and compared in lower case so that "The" is considered to be the same word as "the".

  1. The most common word in the the text
  2. The top 5 words in the text excluding stop words

You are provided with a stopwords.txt file. This file contains a list of commons words (a, the, an, etc) with one word per line. For one of your tasks, you will need to exclude the words found in this file from your word count.

When run, your program should produce output something like the following:

There are 202646 words in the text
There are 13081 unique words in the text
The most common word is 'the' which occurs 6283 times
The top 5 words (excluding stop words) are:
thou (1403 uses)
thy (1059 uses)
king (887 uses)
shall (845 uses)
thee (760 uses)

Data Sources

Resources

About