Detailed documentation is available here.
IMPORTANT: If you used or are using a version of this repo from before 19.02.2021, an error in date processing may have caused wrongly assigned dates for health outcomes. Please re-download and re-process any data processed with addNewHES.py
.
This is the in-development version and major changes and corrections may be made - use at your own risk! Please share comments, suggestions and errors/bugs found, either directly on the GitHub page or by emailing rosemary.walmsley@gtc.ox.ac.uk.
This usage tutorial assumes you have downloaded and extracted a .csv
file containing participant data and a hesin_all.csv
file with health record data from UK Biobank. The download folder contains guidance on how to download these.
To use this repo, run:
$ git clone git@github.com:activityMonitoring/ukb_download_and_prep_template
This repo requires pandas
and nltk
. If you are using an Anaconda installation of Python, these are included. Otherwise, run:
$ pip install pandas
$ pip install nltk
Navigate to the repo:
$ cd ukb_download_and_prep_template
You should have a ukb12345.csv
participant data file which looks something like this:
eid | 31-0.0 | 34-0.0 | 54-0.0 | ... |
---|---|---|---|---|
4987419 | 0 | 1944 | 11016 | ... |
2898413 | 0 | 1956 | 11009 | ... |
1049655 | 1 | 1947 | 11010 | ... |
1892589 | 1 | 1941 | 11011 | ... |
2449164 | 1 | 1958 | 11010 | ... |
The next step towards having ready-to-use data is to filter out some columns and parse the field IDs and categorical codes.
- Auto-generate a
columns.json
file from the text file of field IDs (in the format used in download_participant_data):
$ python writeColumnsFile.py --columnsFile analysisCols.txt
- Run:
$ python filterUKB.py ukb12345.csv -o outputFilename.csv
We now add columns on disease diagnoses in hospital. You will need:
hesin_all.csv
: this is a file containing Hospital Episode Statistics data for all participants.icdGroups.json
: this is a JSON file containing descriptions of required HES code.- An existing dataset
input.csv
(which might beoutputFilename.csv
from the last section). - If you want to define prevalent and incident disease,
input.csv
should also contain a date column which will be used to define this.
Then run:
python3 addNewHES.py input.csv hesin_all.csv output.csv icdGroups.json --incident_prevalent True --date_column 'name_of_date_column'