Try it

Navigate to https://cohelm.fullmeta.co.uk/
Upload a sample medical record
Wait … wait … then wait some more.

I opted to play with OpenAI Assistants API which is very much BETA and it shows. It is half baked and excruciatingly slow.

See **Notes** below for more details

Install

If I were you, I’d just try it at https://cohelm.fullmeta.co.uk/ and not bother running locally.

Well, if you insist …

Install direnv

# on Mac
$ brew istall direnv

# on Debian or ubuntu
$ brew install direnv
$ apt install python3.11-venv

Enable direnv in your .zshrc or .bashrc

Add the following to your .zshrc or .bashrc accordingly

# ~/.bashrc
# ---------
eval "$(direnv hook bash)"

setopt PROMPT_SUBST

show_virtual_env() {
  if [[ -n "$VIRTUAL_ENV" && -n "$DIRENV_DIR" ]]; then
    echo "($(basename $VIRTUAL_ENV))"
  fi
}
PS1='$(show_virtual_env)'$PS1


# ~/.zshrc
# --------
eval "$(direnv hook zsh)"

setopt PROMPT_SUBST

show_virtual_env() {
  if [[ -n "$VIRTUAL_ENV" && -n "$DIRENV_DIR" ]]; then
    echo "($(basename $VIRTUAL_ENV))"
  fi
}
PS1='$(show_virtual_env)'$PS1

Clone repo, install deps, activate venv

Clone the repo, activate virtual environment, install dependencies

$ git clone git@github.com:vkz/cohelm.git
$ cd cohelm
# you only need allow once
$ direnv allow
$ python -m ensurepip --upgrade
# assuming your direnv has been set up correctly virtual env will activate
$ pip3 install -r requirements.txt

OpenAI token

Add your OpenAI API token to .env

OPENAI_API_KEY=sk-...

Start server

Start the server

$ python manage.py runserver 0.0.0.0:8000

Browse

http://localhost:8000

Notes

Where’s the meat of it?

app/ai.py app/views.py prompts/

Have you accomplished all tasks?

No. I only managed to squeeze a few hours here and there, but it is mostly done.

Other than polish and making it a bit more robust, the missing bit is the final verdict, which amounts to:

parse guidelines response - where each “line” is a JSON object,
combine into correct boolean expression i.e. disjunction of conjunctions as per colonoscopy guidelines.

Why hasn’t this been done?

Lack of time mostly, but also this requires LLM reliably and robustly returning valid JSON as per our templates. The all new gpt-3.5-turbo-1106 is a bit janky there.

Approach taken and LLM robustness

What would a typical approach look like?

ingest and generate embeddings from medical records
use OpenAI API chat completions to get answers

Medical records are so small, that honestly vectorizing them is overkill. All prompts, guidelines and a medical record would safely fit into context window for gpt-3.5 and gpt-4.

What approach did we use?

I wanted to try something new, so I opted to play with a shiny and new OpenAI Assistants API.

Mostly it saves us the need to do data extraction and embeddings, cause it takes care of that part “automagically”.

It turned out a mixed bag. Very much BETA:

some endpoints are missing in the official SDK,
runs are excruciatingly slow,
runs require polling

Promts and robust results

prompts/ will tell you the approach I took, namely forcing LLM to return well-formed JSON objects. In my experience this is a hit and miss approach but can be made to work somewhat reliably. Essentially, we want LLM to always respond with valid JSON based on the JSON template we provide. To make it somewhat more reliable we’d want ot intro basic validation followed by possible retries:

try to parse and validate LLM response as JSON. My code does that.
if response is invalid, trigger a retry. I’ve not had time to do that bit, but it is easy enough.

Instead of JSON we could’ve used Pydantic models or whatever else structured format. This wouldn’t significantly improve the situation if at all.

Alternative approach to robust results

For lack of time and since I’ve been playing with the API that’s new to me, I opted to not take the approach I know to be more reliable from experience. That is, instead of forcing the return to be a valid structure like JSON, we supply a handful of functions that LLM must call to provide results, e.g.:

result(true | false) would unambiguously answer whatever question we asked,
evidence("" | "quote from medical record when available")

Hallucinations

That one is tricky. By asking for evidence as quotes or citations we address it to some degree, but certainly not 100%. Other than improving prompts, one possible approach could be to lean more on vector “distance” for embeddings e.g. whenever LLM provides “evidence” we search for embeddings that are “very close” and only accept such “evidence”. Else we could supply a few embeddings that are close to the question being asked and only accept “evidence” that’s one of the embeddings supplied verbatim.

Whatever the approach, it’ll require lots of experiments.

vkz / cohelm