Tool for modelling change and persistence in newspaper content. For an exposition of the underlying method see Persistent News: The Information Dynamics of Nordic Newspapers and for design see News-fluxus design specification.


For running in virtual environment (recommended) and assuming python3.7+ is installed.

$ sudo pip3 install virtualenv
$ virtualenv -p /usr/bin/python3.7 venv
$ source venv/bin/activate


Clone repository and install requirements

$ git clone https://github.com/centre-for-humanities-computing/newsFluxus.git
$ pip3 install -r requirements.txt

GPU acceleration

Currently the requirements file installs torch and torchvision without support for GPU acceleration. If you want to use your accelerator(-s) comment out torch and torchvision in the requirements file, uninstall with pip (if relevant), and run pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html for your desired CUDA version (in this case 11.0+).

Install Mallet

Clone and install Mallet (plus dependencies)

$ sudo apt-get install default-jdk
$ sudo apt-get install ant
$ git clone git@github.com:mimno/Mallet.git
$ cd Mallet/
$ ant

Change path the local mallet installation in src/tekisuto/models/latentsemantics.py

Test Mallet wrapper

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.wrappers import LdaMallet

>>> path_to_mallet_binary = "/path/to/mallet/binary"
>>> model = LdaMallet(path_to_mallet_binary, corpus=common_corpus, num_topics=20, id2word=common_dictionary)

Download language resources

$ python downloader.py --langauge <language-code>
# ex. for Danish langauge resources
$ python downloader.py --language da

And you will be prompted for location to store data, just use default. To find language codes see Stanza

Test Stanza Installation

>>> import stanza

>>> nlp = stanza.Pipeline(lang="da")
>>> doc = nlp("Rap! rap! sagde hun, og så rappede de sig alt hvad de kunne, og så til alle sider under de grønne blade, og moderen lod dem se så meget de ville, for det grønne er godt for øjnene.")
>>> doc.sentences[0].print_dependencies()

Train model and extract signal

$ bash main.sh

And individually

$ python src/bow_mdl.py --dataset <path-to-dataset> --language <language-code> --bytestore <frequency-of-backup> --sourcename <name-of-dataset> --estimate "<start stop step>" --verbose <frequency-of-log>
$ python src/signal_extraction.py --model <path-to-serialized-model>
# ex. for Danish sample
$ python bow_mdl.py --dataset ../dat/sample.ndjson --language da --bytestore 100 --estimate "20 50 10" --sourcename sample --verbose 100
$ python python src/signal_extraction.py --model mdl/da_sample_model.pcl

Research use-case

Requires matplotlib

$ python src/news_uncertainty.py --dataset mdl/da_sample_signal.json --window 7 --figure "fig"

resulting visualizations in fig/


Edition Date Comment
v1.0 June 04 2020 Launch
v1.1 January 14 2020 New NLP pipeline


Kristoffer L. Nielbo


This project is licensed under the MIT License - see the LICENSE.md file for details


Stopwords ISO for their multilingual collection of stopwords.


