This repository uses the code and data set from https://norvig.com/spell-correct.html (with minor modifications).
This set of exercises is intended to introduce students to a detailed analysis of a small-sized computer program. The goal of the exercise is to teach the students how to follow the flow of the program in order to understand what is happening in each step of the execution. Consequently, this will enable the students to reason about potential weaknesses or flaws in the program.
Begin by reading the introductory section and the “How It Works: Some Probability Theory” at https://norvig.com/spell-correct.html. Do not be troubled if you find some parts of the text difficult to understand. While it would be commendable to study external resources in order to fully understand the text it is not required. Rather, be careful not to get stuck!
If you do not have Python on your machine install it. Clone this repository onto your machine and open it with your favorite IDE (if you do not have one consider PyCharm).
Answer the questions below (they are meant to be answered in the order in which they are asked):
-
Think about what a spellchecking program in general does and briefly explain in your own words. In addition:
-
Briefly explain how you think a spellchecker should behave if it sees a perfectly correct word.
-
What should spellchecker ideally do in case it sees an incorrect word?
-
What does it mean for a word to be correct or incorrect?
-
-
Open the
spellcheck.py
file. Two modules are imported for use in the spelling program –re
andcollections
(theCounter
class). The first one is a collection of classes and functions supporting the use of regular expressions. The second one offers a couple of advanced data containers such as theCounter
. Find the documentation for the collections module and understand what theCounter
class does. Explain what the expected input to the class constructorCounter(input)
is and what is the expected output. -
Look at the
words
function and understand how it works. Suppose we input the following short text into the function (show the exact output the function will return):All opinions are subject to modification and technical correction prior to official publication in the Connecticut Reports and Connecticut Appellate Reports.
-
With the understanding of the
Counter
class and thewords
function explain what happens when theWORDS
variable is initialized. -
Once again looking at the
WORDS
variable what do you think is its role in the context of the whole program? -
How does the
known
function work? What does it expect on the input and what does it provide as an output? -
The
edits1
function is probably the most complicated piece of the code. The assignment of theletters
variable is straightforward. Understanding the rest will likely require a bit of effort:-
Understand what happens at the line where the
splits
variable is being instantiated. Assuming we input the word “artificial” into theedits1
function show what exact data thesplits
variable will hold? -
Understand what happens at the line where the
deletes
variable is being instantiated. Assuming we take the output you arrived at in the previous subquestion (7.i) show what exact data will the variabledeletes
hold. In the context of the whole code why do you think this data is useful? -
Understand what happens at the line where the
transposes
variable is being instantiated. Assuming we take the output you arrived at in the subquestion 7.i show what exact data the variabletransposes
will hold. In the context of the whole code why do you think this data is useful? -
Understand what happens at the line where the
replaces
variable is being instantiated. Assuming we take the output you arrived at in the subquestion 7.i show the first 10 elements of the exact data that the variablereplaces
holds. In the context of the whole code why do you think this data is useful? Referring back to theletters
variable how does the method work with respect to the words that contain other letters than the standard English set of 26 characters. How could one solve this deficiency? What could be a possible problem if the set of characters we use would be really large (say, several millions)? -
Understand what happens at the line where the
inserts
variable is being instantiated. Assuming we take the output you arrived at in the previous subquestion (7.i) show the first 10 elements of the exact data that the variableinserts
hold. In the context of the whole code why do you think this data is useful? Referring back to the letters variable how does the method work with respect to the words that contain other letters than the standard English set of 26 characters. -
Given your answers to the previous subquestions (7.i – 7.v) explain what is going to be the output of the
edits1
function.
-
-
Explain how the function
edits2
works. What does it expect as an input and what does it provide as an output? -
Explain how the function
candidates
works. What does it expect as an input and what does it provide as an output? -
Explain how the function
correction
works. What does it expect as an input and what does it provide as an output? HINT: Focus on the optionalkey
argument in themax
function. -
Run the program (i.e., the
correction
function) using the file./data/big.txt
to instantiate theWORDS
variable for the below words. Indicate when the ideal output is obtained and when not.-
advrnture (ideally returns “adventure”)
-
copyrlght (ideally returns “copyright”)
-
litogation (ideally return “litigation”)
-
colusion (ideally returns “collusion”)
-
amlcus (ideally returns “amicus”)
-
caselpad (ideally returns “caseload”)
-
-
What is the reason for not getting the ideal output for some of the words in question 11? How could one eliminate or mitigate the issue?
-
Are we guaranteed to get the ideal output for the words where the issue we identified in question 12 does not exist?
-
Open the
tests.py
file. Theunit_tests
function makes an extensive use of theassert
statement? What does the statement do? -
Explain what is being tested by the first 9 tests (
assert
statements) in theunit_test
function. Then explain what is being tested by the following 8 tests (separate explanation for each). -
Run the
unit_test
function. Did all the tests pass? How did you determine that? What does it mean if all the tests pass? -
Elaborate on what role functions like the
unit_tests
could play in programming?