hetio / het.io

Source code for https://het.io website

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

header info in input files

jcbarret opened this issue · comments

I'm looking at files at http://het.io/disease-genes/downloads/ and am wondering if there's a key to the headers of the different input files? For example, https://raw.githubusercontent.com/dhimmel/het.io-dag-data/d8028c8820322ae4ad7642998bccc3ee7318ff16/downloads/diseases.txt has columns HC-P, HC-S, LC-P, LC-S but I'm not sure what they are. Sorry if this is obvious somewhere, but I couldn't find it after some searching.

The S6 Data caption from the associated PLOS Computational Biology paper is slightly more helpful:

An extended version of Table 3 including all diseases with at least one GWAS-Catalog-extracted association. The manual pathophysiology classification is included.

The caption for Table 3 is:

Diseases. Associations were predicted for 29 diseases with at least 10 positives. For these diseases, the number of high-confidence primary (HC-P), high-confidence secondary (HC-S), low-confidence primary (LC-P), and low-confidence secondary associations (LC-S) that were extracted from the GWAS Catalog is indicated.

So hopefully that answers your questions regarding diseases.txt. See the Associations Method section for more about how disease-gene associations were extracted from the GWAS catalog and what HC-P, HC-S, LC-P, and LC-S mean.

Note that the files available at http://het.io/disease-genes/downloads/ are from our 2015 study to predict disease-associated genes. In general, most users will be interested in Hetionet v1.0, which is available at https://neo4j.het.io (is down right now, will fix) and at https://github.com/dhimmel/hetionet. This hetnet is descibed in our 2017 eLife study called Project Rephetio. This project has much more detailed supplementary methods, since we discussed all code and data on Thinklab while performing the project. For example, see this discussion for how we processed the GWAS Catalog to get gene-disease associations in Project Rephetio. We used a very similar method to what we did in the predecessor study that created diseases.txt mentioned above.

More generally, @jcbarret correctly points out an issue that the table columns are not very well documented for the files at http://het.io/disease-genes/downloads/. At this point, I don't have any immediate plans to fix this issue, but encourage users to post GitHub issues with any questions. At some point in the future, I'd like to revamp the het.io website and may address some of these issues then.

We're moving the downloads page for the disease-genes study to GitHub from https://het.io/disease-genes/downloads/.

The READMDE (pinned version) now shows the first two row of each table for more convenience. While columns are still not fully documented, I will close this for now. Happy to elaborate on column meanings as requested. As I note above, most users will probably be interested in the newer Hetionet data instead.