Showcase for 13-class scientific statement classification

Method

latexml converts the source into an HTML5 document
llamapun tokenizes the first paragraph into a plain-text representation with sub-formula lexemes
tensorflow executes a pre-trained BiLSTM model with 13 classification targets
served as a rocket web service

Details

For the scientific work behind this showcase, please read our paper

The current deployed model is a Keras BiLSTM(128)→BiLSTM(64)→LSTM(64), with a Dense(13) softmax output. The model file 13_class_statement_classification_bilstm.pb can be downloaded from this repository via git-lfs. It is compatible with the rust wrapper for tensorflow and compiled to use a CPU implementation of LSTM, as our demo server has no dedicated GPU.

The input layer is embedded via the arxmliv 08.2018 GloVe embeddings, as well as padded/truncated to a maximum length of 480 words. A paragraph is hence a fixed (480,300) matrix, as passed into the bilstm layer.

The specific model in this demo was trained on 8.3 million paragraphs from the arxmliv 08.2018 dataset, and tested on 2.1 million paragraphs respectively, obtaining a 0.91 F1 score on a target of 13 classes. The base rate baseline was 0.38, the frequency of the "proposition" class.

For more experimental details, please see the main experiment repository.

For practical evaluation, a likelihood threshold could be used, where entries with smaller likelihoods (e.g. <0.3) can be considered as an "other" label.

About

A Web Showcase for Scientific Statement Classification

https://corpora.mathweb.org/classify_paragraph

The Unlicense

Languages

Language:JavaScript 73.0%Language:Rust 19.9%Language:CSS 6.9%Language:PureBasic 0.2%