Carleton SCSI 2019 Research Project: Natural Language Processing

Research Question

Can generated text with slight editing to improve grammar and context pass as legitimately written text?

Text Generation

The text is generated using a dictionary of bigram "seeds," processed from the text file(s). The text is first read in and all phrases surrounded by brackets ({},[]) are removed. The text is then split into a list of strings, and this is processed one more time by putting all of the quote and parenthese phrases into one index accordingly. A dictionary is made based off of the strings, where a tuple (pair of strings) is used as a key, and every unique occurence of the following word is placed into the list mapped to the key. Certain words are selected as "starter" words, which will be used as the first seed to generate text from. Generation starts by randomly selecting a "starter" word, and then a random value from the list of words mapped to the "starter" is selected. This word is added to generated text, and the next key is generated by removing the first object in the tuple, moving the second object to the first, and then having the just added word become the second object in the tuple.

Author Spoofing

The generator can be used to spoof an author's text, by simply adding in multiple works by the same author. The algorithm will then follow the author's writing style, and text will be generated accordingly. This can be used to fake writing by authors, or reports.

DISCLAIMER: WE DO NOT CONDONE THIS ACT TO BE USED LIBERALLY OR FOR ANY PERSONAL GAIN

Use

Either modify the directory path to select a file, or change the list of "reports" to read multiple files. Only enter the file name without the ".txt" to use the program. In the main function, uncomment the line(s) that correspond to your wanted use. The for loop reads multiple files, while the single directory read line reads one file.

Requires:
nltk library: use "pip3 install nltk" on the command line

Modifications: Biased read --> removes all sentences that don't follow the given biased. Uncomment the 'pos' for the generated text to be generally positive, vice versa.

WARNING: This will be slow depending on the size of your file(s).

Results

The bounding score calculated by generating random guesses for 1,000,000 tries is 1.5. Therefore if the average score is above or equal to this bound, the project is a success. After testing multiple partcipants with 6 texts, either from communism.txt or global.txt, we conclude that the generated text can effectively pass as human written. Our average test score was 1.5625, meaning overall participants couldn't tell which texts were real or fake due to their similarity, and had to guess. Our testing table is displayed below. Green highlights means that a fake text was passed as a real one, yellow for a real text mistaken as a fake one, and red for a fake text correctly identified as a fake.

Add screen shot of finalized research poster

References

1708.05148
CS224D Stanford
White Paper on NLP
Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.

Acknowledgements

Thanks to Caroline Lu, Charles Nykamp, Claire Wen, Dae’ Kevion Dicason, Daisy Wang, Justice Osondu, Kentron White, Michael Schultz, Nigel Webb, Rashad Philizaire, Roland Liu and Teresa Luo, Sabrina Ross, Serena Arora, and Sujan Arora for participating in our experiment.

Also thanks to our Professor, Andy Exley, and Research Assistant, Cole DiIanni, for guidance and teachings.

TangyKiwi / Carleton