quanqhow / DoyleInvestigators2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Doyle Investigators - Vector Embeddings and Adversarial Analysis

This project constructs a binary classifier for Sir Arthur Conan Doyle using a dataset of Sherlock Holmes novels and short stories.

General Information

The authordetect package follows a modular object-oriented approach. The most relevant classes are:

  • Author (authordetect/author.py) - This class represents a corpus corresponding to a single author and provides capabilities to load and tokenize corpus, partition into documents, create embedding models for author and each document. All these actions are part of the writer2vec algorithm (see Overleaf paper), and a method with the same name is provided that applies these transformations as a single step.
  • Tokenizer (authordetect/tokenizer.py) - This class represents a tokenizer for performing sentence segmentation and tokenization of an Author's corpus. It also contains a list of stopwords (from NLTK).
  • EmbeddingModel (authordetect/embedding.py) - This class represents a vector embedding model and is a wrapper over Gensim's Word2Vec with added capabilities to save/load embeddings and ease of use. Embedding with normalized vectors are used by default.
  • Classifier (authordetect/classifier) - This class represents a MLP classifier and is used to train on document vectors (with corresponding lables). Afterwards, it can provide predictions on new document vectors.

For reproducible results, set the seed paramater during training and prediction. Also, set environment variable PYTHONHASHSEED to an integer prior to launching Python interpreter process.

Installation

  • The following packages are required (see requirements.txt):
    • Python 3.6 or greater
    • typing
    • configparser
    • unidecode
    • urllib3
    • smart_open
    • bs4
    • psutil
    • nltk
    • gensim
    • scikit-learn
    • pandas
    • seaborn
    • matplotlib
    • numpy

Local Install

  • Install package and dependencies on a local system

    > git clone https://github.com/edponce/DoyleInvestigators2.git
  • Create a virtual environment (Anaconda)

    > conda create -n authordetect python=3.7
    > cd DoyleInvestigators2
    > pip install -e .
    > python setup_nltk.py
    > python
  • See Usage section below.

    >>> import authordetect
    >>> ...

Google Colab Install

  • See example notebook in drivers/AuthorDetect_AuthorEmbedding.ipynb. The code is download directly from GitHub repo and installed.
    >>> !pip install git+https://github.com/edponce/DoyleInvestigators2
    >>> # May need to restart runtime so that correct package versions are loaded
    Set up NLTK:
    >>> import nltk
    >>> nltk.download('stopwords')
    >>> nltk.download('punkt')  # sentencizer
    >>> nltk.download('averaged_perceptron_tagger')  # tagger
    >>> nltk.download('universal_tagset')  # universal POS tags
    >>> nltk.download('wordnet')  # lemmatizer
    For data files, you need to mount the Google Drive so that the folder shared with corpus data is visible for notebook.
    >>> from google.colab import drive
    >>> drive.mount('/content/gdrive')
    Now you should be able to run authordetect:
    >>> from authordetect import Author
    >>> infile = '/content/gdrive/My Drive/.../text.txt'
    >>> author = Author(infile)
    >>> ...

Usage

Example: Create an author's embedding matrix

>>> # Load an author's corpus
>>> from author import Author, Tokenizer
>>> author = Author('data/Doyle_10.txt')
>>> author.corpus  # this is the raw text
>>>
>>> # Preprocess text without removing stopwords
>>> tokenizer = Tokenizer(use_stopwords=False)
>>> author.preprocess(tokenizer)
>>>
>>> # Create an author's word2vec embedding model
>>> author.embed()
>>> author.embedding.vocabulary  # access vocabulary from entire corpus
>>> author.embedding.vectors  # access non-normalized embedding matrix (NumPy 2D array)
>>> author.embedding.vectors_norm  # access normalized embedding matrix (NumPy 2D array)
>>> author.embedding['holmes']  # get vector associated with a word

Example: Save and load author's embedding model

  • Save Gensim's Word2Vec model:

    >>> author.embedding.save('my_embedding.bin')
  • Load existing Gensim's Word2Vec model:

    >>> from authordetect import Author, EmbeddingModel
    >>> embedding = EmbeddingModel()
    >>> embedding.load('my_embedding.bin')
    >>>
    >>> # Use the loaded embedding with an Author
    >>> author = Author('text.txt')
    >>> author.preprocess()
    >>> author.embed(embedding)

Datasets and Models

  • MLP classifier models and author embeddings were created with the training/driver_train.py script setting seed=0, PYTHONHASHSEED=0, and remain_factor=350/<part_size>.

  • US/UK English translation was performed to entire corpus.

    > cd lang_translation/
    > python driver_translate.py uk ../data/Rinehart_10.txt ../data/Rinehart_10_uk.txt

    To view in web application, enable the tag option (last argument)

    > python driver_translate.py uk ../data/Rinehart_10.txt ../data/Rinehart_10_uk_tag.txt 1
  • Synonym replacement were performed using the embedding models of 50 dimension and corresponding document partition size.

    > cd synonyms/
    > python driver_synonyms.py 0 0.2 ../data/Rinehart_10.txt ../data/Rinehart_10_syn_350.txt ../training/doyle_50dim_350part.bin

    To view in web application, enable the tag option (first argument)

    > python driver_synonyms.py 1 0.2 ../data/Rinehart_10.txt ../data/Rinehart_10_syn_350_tag.txt
  • Test datasets JSON files were created by combining the 10% perturbed files. This script takes multiple text files with corresponding labels and partitions them into documents, then shuffles them and exports list of files to a JSON file.

    > cd test_datasets/
    > python driver_create_json.py 350 perturbed_langtranslation_rinehart_350.json ../data/Doyle_10_uk.txt doyle ../data/Christie_10_uk.txt christie ../data/Rinehart_10_uk.txt rinehart
  • There are helper scripts to compute frequency of perturbations for making plots. First you need to create JSON file of the original corpus. For example, for language translation:

    > cd test_datasets/
    > python driver_create_json.py 350 original_rinehart_350.json ../data/Rinehart_10.txt rinehart

    Then, process them with freq script to the corresponding perturbation,

    > cd lang_translation/
    > python driver_freq_translate.py uk ../test_datasets/original_rinehart_3500.json perturb_rate_langtranslation_rinehart_3500.json

Novels and Short Stories

  • Selection should have 300K +- 10% words in total.
Type Title Words (N)
Novel The Valley of Fear 58,827
Novel A Study in Scarlet 43,862
Novel The Sign of the Four 43,705
Novel The Hound of the Baskervilles 59,781
Story The Boscombe Valley Mystery 9,722
Story The Five Orange Pips 7,388
Story The Adventure of the Speckled Band 9,938
Story The Adventure of the Cardboard Box 8,795
Story The Musgave Ritual 7,642
Story The Reigate Squires 7,303
Story The Adventure of the Dancing Men 9,776
Story The Adventure of the Second Stain 9,800
Total Gensim tokenizer 276,539
  • Short stories
    • The Adventures of Sherlock Holmes
      • 4 - The Boscombe Valley Mystery
      • 5 - The Five Orange Pips
      • 8 - The Adventure of the Speckled Band
    • Memoirs of Sherlock Holmes (British version)
      • 2 - The Adventure of the Cardboard Box
      • 6 - The Musgave Ritual
      • 7 - The Reigate Squires
    • The Return of Sherlock Holmes
      • 3 - The Adventure of the Dancing Men
      • 13 - The Adventure of the Second Stain

Pipeline Document

https://docs.google.com/document/d/1lYdSgOwpMAF2GGBTz4h0kvHQPEfisEoplJDX4_YUQSc/edit?usp=sharing

Preprocessing

  • Lowercase
  • Remove non-alpha symbols
  • Lemmatize (NLTK)

Sentence Segmentation

Type Sentences (N)
NLTK line 18,616
NLTK punctuation 18,638

Word Embeddings

  • word2vec parameters: free choice
  • Construct models using embedding sizes: 50 and 300
  • For document embeddings, use the entire document (no random words as in paper)
  • Unknown tokens are set to a zero vector

MLP

  • MLP parameters: free choice
  • For MLP input, average document embeddings into a single vector

Training, Validation, and Testing Datasets

  • Data unit - represents a contiguous collection of words that create a "document" of the corresponding author. To create a data unit, always start at the beginning of a sentence and end when word count is fulfilled.
    • 1/2 page - 350 words
    • 2 page - 1,400 words
    • 5 page - 3,500 words
  • 90/10 using documents as the data unit
    • Split 90% into 50/25/25
    • 10% for testing
  • 90/10, share 10 with other groups to perturb
    • From 10% use 80/20 for defeat dataset

Adversarial Techniques

  • Each group will apply at least 2 perturbations.
    • All groups will do synonyms replacement - approach can differ (free choice)
    • Doyle - US/British English translation
    • Rinehart - contractions or pronouns
    • Christie - undecided
  • Apply perturbations to selective data
  • The question on how much perturbation to apply to each document will depend on the perturbation itself. Some approaches will modify more text than others. We suggest to limit the perturbation effect to 20% for each document. If a perturbation changes less than 20%, then you can consider all its changes. If a perturbation exceeds the 20%, then limit it.
  • For synonyms perturbations: 20% upper limit per document
  • For second perturbation: up to group's discretion

Doyle Group Proposed Ideas

  • Language translation (USEnglish to British) - Google translate
  • Synonym replacement using word vector similarity, part of speech, other model-agnostic qualities
  • Change tense - https://github.com/bendichter/tenseflow
  • Change singular and plural forms of words, change numbers and text - https://github.com/jazzband/inflect
  • Invert text and word order
  • Rearrange neighbor sentences
  • Introducing typos (letter flipping)

Reinhart Group Proposed Ideas

Edmon's Comments on Proposed Ideas

  • Synonyms are good.
  • British to US English is OK, but tense change or typos are most likely not.
  • Tense can possibly change the meaning of the text, but if done carefully it could be fine. (Thinks of participles vs. simple past tense, etc. He was in prison/he has been in prison, etc. )
  • Character flipping can turn text into gibberish or can alter the meaning. There is no easy way to control it. (E.g. mud/mad, pea/pee, tea/tee, stop/step, and so on ...)
  • Plurals and singulars are tricky. He murdered a woman is not the same as he murdered women.
  • Re-arranging sentences how? You could consider changing active to passive voice. It is reasonably safe way.
  • Changing numbers and text might OK, but you could also squash meaning if done carelessly and automatically.

Notes from PM meeting on 10/14/20

  • All groups will select 4-5 crime novels (same from Project 1) that contain a total of 300K +- 10% tokens. The 4 Doyle's novels used in Project 1 have a total of ~203K words, and given that are no more Holmes' novels, we added 8 short stories.
  • We will use 3 data resolutions: 350 words (1/2 page), 1400 words (2 pages), and 3500 words (5 pages). The data units will be selected from the single merged text file by starting at a first word of a sentence and ending at the end of a sentence that results closest in number of words to the data unit but not more. These data units will be non-overlapping.
  • Groups will share perturbation ideas. Edmon will decide on a handful to from these to be assigned to groups. The perturbations will vary between groups.
  • The goal is to replicate the classification approach presented in assigned paper. We will have 6 w2v models per author (Nx6) and 6 MLP heads.
  • We will use two embedding sizes for the vector embeddings: 50 and 300.
  • All groups will share text data as follows: Extract only the prose from all novels (no headings, no metadata) and merge together into a single file with no formatting changes except removing empty lines.

About

License:Other


Languages

Language:Jupyter Notebook 91.1%Language:Python 8.8%Language:Shell 0.1%Language:Makefile 0.0%