Exploration of the Simulacrum Artificial Patient Cancer Data

Introduction

To give you some context on the work that is done in this repo we have written a report - projectreport.pdf and have made slides from our presentation - presentation.pptx (above).

Getting Started

Setting up the repository

Create a directory which will hold your simulacrum project and navigate to it.

mkdir simulacrum
cd simulacrum

Clone the EDA repository inside your project directory

git clone git@github.com:UCL-simulacrum/EDA.git

Getting the data

Download and unzip the simulacrum dataset in your project directory (e.g. ./simulacrum/). If your operating system is macOS and wget is not installed, you can install it by brew install wget which uses the Homebrew package manager for macOS.

Note that by specifiying a directory in the unzip function with the -d option, unzip will unpack the contents into that directory.

wget https://simulacrum.healthdatainsight.org.uk/releases/simulacrum_release_v1.1.0.zip
unzip simulacrum_release_v1.1.0.zip -d simulacrum_release_v1.1.0

Alternatively you can download the data here.

NB The data should be unzipped into the parent directory of EDA. Your directory structure should look something like this:

simulacrum
│       
│
└───EDA
│   │   cancerdata_EDA.ipynb
│   │   patientpathways.ipynb
|   |   ...
│   
|
└───simulacrum_release_v1.1.0
    │   create_table_sim_av_patient.sql
    │   create_table_sim_av_tumour.sql
    |   ...

Creating a conda environment

This will create a conda development environment with all the dependencies loaded.

Update conda and append conda-forge (missingno is not available through default search channel)

conda update conda
conda config --append channels conda-forge

Now create a conda environment called simulacrum using the environment yaml file for this project. Note that the environment name "simulacrum" is defined inside the .yml file.

conda env create -f environment.yml

Then activate that environment to be in it:

source activate simulacrum

Once in that environment, further installs are needed for one of the notebooks:

pip install keras tensorflow

Start Jupyter

jupyter notebook

Viewing the Jupyter Notebooks

Suggested viewing order of notebooks (those notebooks with Plotly graphs do not directly render on GitHub, so they can be viewed via nbviewer.jupyter.org):

example_sql_queries.ipynb
cancerdata_EDA.ipynb (via nbviewer)
sact_regimen_study.ipynb
patientpathways.ipynb (via nbviewer)

andre-vauvelle / EDA