matthewmcvickar / oblique-questions-bot

A bot that posts questions without context.

Home Page:https://botsin.space/@obliquestions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Oblique Questions Bot

A bot that posts questions without context. The questions are drawn from Project Gutenberg texts.

Currently posting several times a day to Mastodon. (It used to post to Twitter, but I don't use Twitter anymore and neither do my bots.)

πŸ“š ❓ πŸ€– β†’ @obliquestions@botsin.space on Mastodon

How I Built It

Taking inspiration from Hugo van Kemenade's gutengrep project, the initial corpus was derived from books in the Project Gutenberg 'August 2003' CD. To make the dataset cleaner to begin with, I removed almost 200 books from the collection manually before building my corpus. These included non-English texts, poetry and dramatic texts, texts heavy with dialect, and religious, mathematical, encyclopedic, and political texts.

This left me with about 400 books. I used gutengrep to tokenize the texts into sentences.

Once tokenized, I cleaned up the corpus a bit:

  • deleted duplicate lines (with Sublime Text's Edit β†’ Permute Lines β†’ Unique command)
  • deleted empty lines (found \n\n and replaced it with \n).

Then I wrote a script (build-corpus.js) to format and filter the sentences into a set of postable questions. In order:

  • Removed beginning and trailing quotation marks, such that questions that were quotations in the original text would be posted as though they were prose.

  • Capitalized the first letter of the question, in case it wasn't already capitalized.

  • Filtered out any question longer than 140 characters.

  • Filtered out any question that included a proper noun. (I felt this would provide too much context.) I did this with a regular expression that searched for words preceded by a space and starting with a capitalized letter. This doesn't capture proper nouns at the beginning of sentences, but that's fine.

  • Filtered out any question that contained non-letter characters (excluding apostrophes), as they often indicated weird formatting and non-questions:

    1 2 3 4 5 6 7 8 9 0 : ; . " β€œ ” β€˜ ’ < > [ ] ( ) { } ` ~ # $ % ^ & _ + - = \ / |
  • Filtered out any question that contained archaic language (like thine and dost and prithee).

  • Filtered out any question that contained religious language (like moses and buddha and clergy).

  • Filtered out any question that would relate to the text itself or Project Gutenberg itself (like gutenberg and donate and chapter and section).

  • Filtered out the bad words listed in Darius Kazemi's wordfilter.

  • Filtered out any question that contained some additional oppressive language not covered by wordfilter and words that tend to appear in problematic sentences.

If a sentence passed all the filters, I added it to a giant JSON file.

After refining the script, I ended up with a JSON file of about 66K questions.

I then wrote a script (bot.js) that reads the JSON file, chooses a question from it at random, and posts the question.

Acknowledgements

I couldn't have created this bot without the help of the following:

Afterword

This is my first bot. The idea for Oblique Questions came to me while walking the dog on Saturday, October 31, 2015. I started working on it the next day and launched the working bot on the morning of Wednesday, November 4, 2015.

About

A bot that posts questions without context.

https://botsin.space/@obliquestions

License:MIT License


Languages

Language:JavaScript 100.0%