NaNoGenMo / 2016

National Novel Generation Month, 2016 edition.

Home Page:https://nanogenmo.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pride and Prejudice and Innuendo

michelleful opened this issue · comments

Time having run out for my other grander ideas, I am reduced to (once again) taking Jane Austen's great work and injecting puerile humour into it.

This time I attempted to see if I could find words containing innuendo, generally of the sexual variety, and italicise them in a nudge-wink kind of way. After experimenting with a few ways of obtaining the words (chiefly using sense2vec to find words used in similar context to actual swear words), I settled on searching Urban Dictionary for words whose primary (meaning most upvoted, I think) dictionary entry contained the word 'sex'. In addition, I replaced some perfectly innocent words with grawlixes for giggles.

Complete novel

Sample output:

IT is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a &@$%.

"It will be no use to us if twenty such should come, since you will not %&@# them."

"Depend upon it, my dear, that when there are twenty I will %@#& them all."

"Indeed, Sir, I have not the least intention of &$@*ing. — I entreat you not to suppose that I moved this way in order to beg for a partner."

He was most highly esteemed by Mr. Darcy, a most intimate, confidential friend.

I do not pretend to regret any thing I shall leave in Hertfordshire, except your society, my dearest friend; but we will hope at some future period, to enjoy many returns of the delightful intercourse we have known...

Tools used/lessons learned

  • @#&% is called a grawlix.
  • SpaCy (Python)
    • Chiefly for part-of-speech tagging and (very little) dependency parsing.
    • Its token.text_with_ws function is especially useful for maintaining good spacing.
    • There's still room for a Python library to do intelligent text replacement (e.g. handling a/an, conjugation, plurals, phrasal verbs, etc) though.
  • Urban Dictionary and py-urbandict
    • There are a lot of very common and completely innocuous words (and innocuous definitions) in UD, which I didn't expect.
    • I would have liked to use its word combinations but I wound up just using solitary words.
    • Urban Dictionary really could use part-of-speech information.
  • (earlier versions): sense2vec word embeddings
    • Does word2vec on (word, part-of-speech) combinations.
    • Trained on Reddit comments, which I was hoping would know swear words well.
    • Still very hard to triangulate words with multiple meanings like ball, which wasn't close to dance and a bunch of other likely words I tried. Further word sense disambiguation would still be useful.
  • Identifying words with innuendo is really hard and people are doing actual research on this.
    • Expanding the list of words searched for beyond 'sex' would be a good next step.
    • Maybe training a classifier on urban dictionary entries would work even better, incorporating other information like whether a word is used in sex-related subreddits versus other subreddits...

🎉 🍆