`StatsChat`

Code state

Please be aware that for development purposes, these experiments use experimental Large Language Models (LLM's) not intended for production. They can present inaccurate information, hallucinated statements and offensive text by random chance or through malevolent prompts.

Tested on OSX only

Peer-reviewed

Depends on external API's

Under development

Experimental

Introduction

This is an experimental application for semantic search of ONS statistical publications. It uses LangChain to implement a fairly simple embedding search and QA information retrieval process. Upon receiving a query, documents are returned as search results using embedding similarity to score relevance. Additionally, the relevant text is passed to a locally-hosted Large language Model (LLM), which is prompted to write an answer to the original question, if it can, using only the information contained within the documents.

For this prototype, the program is run entirely locally; relevant web pages are scraped and the data stored in data/bulletins, the docstore / embedding store that is created is likewise in local folders and files, and the LLM and all other code is run in memory on your desktop or laptop.

The search program should be able to run on a system with 16GB of ram. The LLM is set up to run on CPU at this research stage. Different models from the Hugging Face repository can be specified for the search and QA functions.

Installation

The project requires specific versions of some packages so it is recommended to set up a virtual environment. Using venv and pip:

python3.10 -m venv env
source env/bin/activate

python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Pre-commit actions

This repository contains a configuration of pre-commit hooks. These are language agnostic and focussed on repository security (such as detection of passwords and API keys). If approaching this project as a developer, you are encouraged to install and enable pre-commits by running the following in your shell:

Install pre-commit:
```
pip install pre-commit
```
Enable pre-commit:
```
pre-commit install
```

Once pre-commits are activated, whenever you commit to this repository a series of checks will be executed. The pre-commits include checking for security keys, large files and unresolved merge conflict headers. The use of active pre-commits are highly encouraged and the given hooks can be expanded with Python or R specific hooks that can automate the code style and linting. For example, the flake8 and black hooks are useful for maintaining consistent Python code formatting.

NOTE: Pre-commit hooks execute Python, so it expects a working Python build.

Usage

By default, flask will look for a file called app.py, you can also name a specific python program to run. With --debug in play, flask will restart every time it detects a saved change in the underlying python files. The first time you run the app, any ML models specified in the code will be downloaded to your machine. This will use a few GB of data and take a few minutes. App and search pipeline parameter are stored and can be updated by editing app_config.toml.

We have included three EXAMPLE scraped data files in data/bulletins so that the preprocessing and app can be run as a small example system without waiting on webscraping.

To webscrape the source documents from ONS

We have removed this script, and for the sake of demonstration included some example scrape results so that the process can be continued from the next step below

# python statschat/webscraping/main.py

To create a local document store

python statschat/preprocess.py

To run the interactive app

flask --debug run

python app.py

The flask app is set respond to https requests on port 5000. To use the user UI navigate in your browser to http://localhost:5000.

The API default url would be http://localhost:5000/api. See API endpoint documentation for more details (note, this is a work in progress).

Search engine parameters

There are some key parameters in app_config.toml that we're experimenting with to improve the search results, and the generated text answer. The current values are initial guesses:

Parameter	Current Value	Function
k_docs	10	Maximum number of search results to return
similarity_threshold	1.0	Cosine distance, a searched document is only returned if it is at least this similar (EQUAL or LOWER)
k_contexts	3	Number of top documents to pass to generative QA LLM

Alternatively, to run the search evaluation pipeline

The StatsChat pipeline is currently evaluated based on small number of test question. The main 'app_config.toml' determines pipeline setting used in evaluation and results are written to data/model_evaluation folder. The evaluation script requires that project root (assumed working directory) be added to PYTHONPATH, this is handled through direnv and the .envrc file.

python statschat/model_evaluation/evaluation.py

Testing

Preferred unittesting framework is PyTest:

pytest

Data Science Campus

At the Data Science Campus we apply data science, and build skills, for public good across the UK and internationally. Get in touch with the Campus at datasciencecampus@ons.gov.uk.

License

The code, unless otherwise stated, is released under the MIT License.

The documentation for this work is subject to © Crown copyright and is available under the terms of the Open Government 3.0 licence.

TuanaCelik / statschat-app