aperrien / epub2vec

Convert a folder of .epubs into a word2vec model and clusters of "similar paragraphs".

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EPUB2Vec

Goals of this script

  • Accept a directory of .epub books as input
  • Generate a word2vec model using the corpus of words available in those books as a domain-specific training set.
  • Analyze each paragraph from those books to figure out the "average word" aka "paragraph vector" for each paragraph.
  • Using k-means clustering, find the most similar paragraphs across the library of books by clustering "paragraph vectors".
  • Output lists of related/relevant passages that can be linked to each other, presented side by side, etc...something interesting.

To run

  • I highly recommend installing the anaconda Python distribution and using conda for further module installations as needed: https://www.continuum.io/downloads
  • Make sure to install all modules imported by the script as well as cython for a dramatic speed increase (included in conda distribs by default).
  • Include .epub files in the root directory alongside this script...run it. Go do something else for a long time - the clustering step will take quite a while.

To explore results

  • Start a local web server in the project root directory using 'python -m SimpleHTTPServer 8001'
  • Open http://localhost:8001/www/index.html in your web browser
  • Choose a cluster by number to load that cluster in the results table
  • Click a link to open the chapter of the cluster results in the book viewer

Example Output

The result explorer is hosted here, including results and ebooks content for public domain books (from Project Gutenberg): https://quanticle.co/projects/epub2vec

Some example clusters as output from early trial runs (each bullet is a paragraph included in the cluster):

The best clusters are topic-centric. This one is loosely "Descriptions/definitions of types of bonds or other assets"
  • Monies issued by national monetary authorities.
  • Bonds secured by specific types of equipment or physical assets.
  • Holding by the central bank of non-domestic currency deposits and non-domestic bonds.
  • Bonds issued by the UK government.
  • A type of non-sovereign bond issued by a state or local government in the United States. It very often (but not always) offers income tax exemptions.
  • A type of non-sovereign bond issued by a state or local government in the United States. It very often (but not always) offers income tax exemptions.
  • In the United States, securities issued by private entities that are not guaranteed by a federal agency or a GSE.
  • A bond issued by a government below the national level, such as a province, region, state, or city.
  • A bond issued by a government below the national level, such as a province, region, state, or city.
  • The most recently issued and most actively traded sovereign securities.
  • The purchase or sale of bonds by the national central bank to implement monetary policy. The bonds traded are usually sovereign bonds issued by the national government.
  • A bond issued by an entity that is either owned or sponsored by a national government. Also called agency bond.
  • A type of central bank policy rate.
  • A bond issued by a national government.
  • A bond issued by a national government.
  • A bond issued by a supranational agency such as the World Bank.
  • Taxes that a government levies on imported goods.
  • Shares that were issued and subsequently repurchased by the company.
  • Shares that were issued and subsequently repurchased by the company.
  • The governing legal credit agreement, typically incorporated by reference in the prospectus. Also called bond indenture.
Some are based on sentence structure. All instances of "See __________"...
  • See common shares.
  • See mortgage rate.
  • See direct format.
  • See alternative trading systems.
  • See indirect format.
  • See alternative trading systems.
  • See mortgage rate.
  • See special purpose entity.
Some are based on type of word. "Bayesian equations about orders"
  • P(Order 1 executes | Order 2 executes) = 1
  • P(Order 1 executes and Order 2 executes) = P(Order 1 executes | Order 2 executes)P(Order 2 * executes) = 1(0.25) = 0.25
  • P(Order 1 executes or Order 2 executes) = P(Order 1 executes) + P(Order 2 executes) − P(Order 1 * executes and Order 2 executes) = 0.35 + 0.25 − 0.25 = 0.35
  • P(Order 1 executes and Order 2 executes) = 0.25 = P(Order 2 executes | Order 1 executes)P(Order 1 executes)
  • 0.25 = P(Order 2 executes | Order 1 executes)(0.35)

And some clusters don't make sense at all ¯\_(ツ)_/¯

Background info

Step Time
Explode .epubs into folders < 30 secs
Extract text from .xhtml files 1 min
Tokenize sentences from full text < 30 secs
Create word2vec model 2 min
Extract paragraphs and calculate paragraph vectors 1 min
Run k means clustering 9 hours
Export data 5 min

About

Convert a folder of .epubs into a word2vec model and clusters of "similar paragraphs".


Languages

Language:Python 37.4%Language:JavaScript 31.0%Language:CSS 25.7%Language:HTML 6.0%