Cognitive Complexity Estimation Framework

Author: Maksim Eremeev (me@maksimeremeev.com)

Research: Konstantin Vorontsov, Maksim Eremeev

Papers:

This is a framework for testing and experimenting with complexity measures, building and saving models fitted on various reference collections. The library provides efficient parallel processing of reference collections.

Requirements

python >= 3.6
numpy
nltk
pymorphy2
multiprocessing

Installation

The framework supports

python setup.py build
python setup.py install

Progress

Implement the basic ComplexityModel class
Parallelization of fit method
Letter, Syllable, and Word Tokenizers for Russian
Distance-based ComplexityFunction
Morphological and Lexical complexity models
Counter-based ComplexityFunction
Counter-based models
Adaptation of morphological models for English
Syntax models based on UdPipe
Making preprocessing more flexible
setup.py and testing on Ubuntu, OSX
Publishing the Open-Source framework

==== You are here ====

Publishing the ComplexityPipeline implementation to fit the aggregated complexity model
Publishing of distributions for all proposed models and validation data
Enhancement of model weights ... (TBD)

Structure

complexity - main must-import module
tokenizers - most common tokenizers implementation
functions - most common complexity functions implementation
data - all data used for experimenting

Reference Collection Format

ComplexityModel uses reference collection to build empirical distributions. The reference collection has to be provided in strictly fixed format.

Each document of the collection must be saved in the separate .txt file. The name of file does not matter.
All files containing documents of the reference collection must be stored in the single directory.
There should not be empty .txt files

Adjusting the model

Complexity model is a combination of two entities - Tokenizer and ComplexityFunction.

Both Tokenizer and ComplexityFunction are to be passed into constructor of the model.

Tokenizer is instance of some class required to have tokenize method.

tokenize(text) takes the only argument - text, which is a string corresponding to a single document. Returns the list of tokens in order they are situated in give text. If text should be preprocessed in some way, preprocessing steps have to be implemented in tokenize method.

Example:

class Tokenizer:
    def tokenize(self, text):
        return text.split()

ComplexityFunction is instance of the abstract class with the only required method - complexity.

complexity(tokens) takes the output of the tokenize method, i.e. list of tokens as they are steered in the prior text. Method returns list of complexity scores for each token in the same order.

class ComplexityFunction:
    def complexity(self, tokens):
	return [len(token) for token in tokens]

Signatures and arguments

Init

ComplexityModel init options:

tokenizer - Tokenizer instance
complexity_function - ComplexityFunction instance
alphabet - 'full' if alphabet consists of more than one token, 'reduced' otherwise. Default: 'full'

Returns: model instance

Example:

tokenizer = Tokenizer()
complexity_function = ComplexityFunction()
cm = ComplexityModel(tokenizer, complexity_function, alphabet='reduced')

Fit

fit(reference_corpus, n_jobs=4, use_preproc=True, use_stem=True, use_lemm=False, check_length=True, check_stopwords=True)

reference_corpus - path to directory with documents of reference collection. Each document must be presented in a separated *.txt file.
n_jobs - number of processes to process the collection. Default: 4
use_preproc - flag indicating whether to preprocess the reference collection documents before tokenizing. Default: True
use_stem - flag indicating whether to use stemming when preprocessing the reference collection documents. Default: True
use_lemm - flag indicating whether to use lemmatization when preprocessing the reference collection documents. Default: True
check_length - flag indicating whether to filter all words shorter than 3 symbols when preprocessing the reference collection documents. Default: True
check_stopwords - flag indicating whether to filter stopwords when preprocessing the reference collection documents. Default: True

Returns nothing

fit uses multiprocessing to process documents of the reference collection in parallel.

Example:

cm.fit('/wikipedia', n_jobs=10, use_preproc=False, use_stem=False, use_lemm=False, check_stopwords=False, check_stopwords=False)

Predict

predict(texts, gamma=0.95, weights='mean', p=1, use_preproc=True, use_stem=True, use_lemm=False, check_length=True, check_stopwords=True, exp_weights=False, weights_min_shift=False, normalize=False, return_token_complexities=False)

texts - lexts to estimate complexity scores for
gamma - quantile indicator. Default: 0.95
weights - Type of weights to use when counting the scoree. One of following options: 'mean', 'total', 'excessive', 'excessive_mean'. Default: 'mean'. Default: 'mean'
p - power of the weights. Default: 1
use_preproc - lag indicating whether to preprocess text before tokenizing. Must align with the same parameter value used for fitting. Default: True
use_stem - flag indicating whether to use lemmatization when preprocessing the text. Must align with the same parameter value used for fitting. Default: True
use_lemm - power of the weights. Default: False
check_length - flag indicating whether to filter the words shorter than 3 symbols when preprocessing the text. Must align with the same parameter value used for fitting. Default: True
check_stopwords - flag indicating whether to filter the stopwords when preprocessing the text. Must align with the same parameter value used for fitting. Default: True
exp_weights - flag indicating whether to apply exponential transformation to weights. Default: False
weights_min_shift - flag indicating whether to subtract the minimum value from the weights. Default: False
normalize - flag indicating whether to normalize the weights. Default: False
return_token_complexities - flag indicating whether to return tokens complexities score along with the overall text complexity score. Default: False

Returns list of scores for the texts provided.

Accessible examples

All following models were described in

models/letters - distance-based morphological model
- tokens: letters
- complexity: distance
models/lexical-distance - distance-based lexical model
- tokens: words
- complexity: distance
models/lexical-counter - counter-based lexical model
- tokens: words
- complexity: number of occurrences in the reference collection
models/lexical-length- counter-based lexical model
- tokens: words
- complexity: length of the word
models/re-syllab - distance-based morphological model for Russian
- tokens: syllables
- complexity: distance
models/ru-syllab-sorted - distance-based morphological model for Russian
- tokens: sorted syllables
- complexity: distance
models/en-syllab - distance-based morphological model for English
- tokens: syllables
- complexity: distance
models/en-syllab-sorted - distance-based morphological model for English
- tokens: sorted syllables
- complexity: distance
models/syntax-length - counter-based syntactic model
- tokens: sentences
- complexity: maximum length of the syntactic dependency
models/syntax-pos - distance-based syntactic model
- tokens: syntgams
- complexity: distance

BibTex

@inproceedings{eremeev19ranlp,
	title={Lexical Quantile-Based Text Complexity Measure},
	author={M. A. Eremeev and Konstantin Vorontsov},
	booktitle={RANLP},
	year={2019}
}

maks5507 / cognitive-complexity

Cognitive Complexity Estimation Framework

Requirements

Installation

Progress

Structure

Reference Collection Format

Adjusting the model

Signatures and arguments

Accessible examples

BibTex

About

Languages