Pride and Prejudice and Innuendo
michelleful opened this issue · comments
Time having run out for my other grander ideas, I am reduced to (once again) taking Jane Austen's great work and injecting puerile humour into it.
This time I attempted to see if I could find words containing innuendo, generally of the sexual variety, and italicise them in a nudge-wink kind of way. After experimenting with a few ways of obtaining the words (chiefly using sense2vec to find words used in similar context to actual swear words), I settled on searching Urban Dictionary for words whose primary (meaning most upvoted, I think) dictionary entry contained the word 'sex'. In addition, I replaced some perfectly innocent words with grawlixes for giggles.
Sample output:
IT is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a &@$%.
"It will be no use to us if twenty such should come, since you will not %&@# them."
"Depend upon it, my dear, that when there are twenty I will %@#& them all."
"Indeed, Sir, I have not the least intention of &$@*ing. — I entreat you not to suppose that I moved this way in order to beg for a partner."
He was most highly esteemed by Mr. Darcy, a most intimate, confidential friend.
I do not pretend to regret any thing I shall leave in Hertfordshire, except your society, my dearest friend; but we will hope at some future period, to enjoy many returns of the delightful intercourse we have known...
Tools used/lessons learned
- @#&% is called a grawlix.
- SpaCy (Python)
- Chiefly for part-of-speech tagging and (very little) dependency parsing.
- Its
token.text_with_ws
function is especially useful for maintaining good spacing. - There's still room for a Python library to do intelligent text replacement (e.g. handling a/an, conjugation, plurals, phrasal verbs, etc) though.
- Urban Dictionary and py-urbandict
- There are a lot of very common and completely innocuous words (and innocuous definitions) in UD, which I didn't expect.
- I would have liked to use its word combinations but I wound up just using solitary words.
- Urban Dictionary really could use part-of-speech information.
- (earlier versions): sense2vec word embeddings
- Does word2vec on (word, part-of-speech) combinations.
- Trained on Reddit comments, which I was hoping would know swear words well.
- Still very hard to triangulate words with multiple meanings like ball, which wasn't close to dance and a bunch of other likely words I tried. Further word sense disambiguation would still be useful.
- Identifying words with innuendo is really hard and people are doing actual research on this.
- Expanding the list of words searched for beyond 'sex' would be a good next step.
- Maybe training a classifier on urban dictionary entries would work even better, incorporating other information like whether a word is used in sex-related subreddits versus other subreddits...
🎉 🍆