kevinluolk / formatbook

A collection of commonly used format for quick lookup.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FormatBook

This is a companion repo for gwaslab.

A collection of commonly used formats for GWAS summmary statistics.

All the formats are stored as json files.

Each format consists of the following info (manually curated):

  1. meta_data: meta data, inlcluding software name, source urls, version and so on.
  2. format_dict: target format to gwaslab format column-name conversion dictionary

For example : format for metal software

{
"meta_data":{"format_name":"metal",
            "format_source":"https://genome.sph.umich.edu/wiki/METAL_Documentation",
            "format_version":"20220726"
            },
"format_dict":{
            "MarkerName":"SNPID",
            "Allele1":"EA",
            "Allele2":"NEA",
            "Freq1":"EAF",
            "Effect":"BETA",
            "StdErr":"SE",
            "P-value":"P",
            "Direction": "DIRECTION"
            }
}

Supported formats:

  1. ssf: GWAS-SSF
  2. gwascatalog : GWAS Catalog format
  3. pgscatalog : PGS Catalog format
  4. plink: PLINK output format
  5. plink2: PLINK2 output format
  6. saige: SAIGE output format
  7. regenie: output format
  8. fastgwa: output format
  9. metal: output format
  10. mrmega: output format
  11. fuma: input format
  12. ldsc: input format
  13. locuszoom: input format
  14. vcf: gwas-vcf format
  15. bolt_lmm : output format

Citations and sources

  1. GWAS-SSF
    • CITATION: Hayhurst, J., Buniello, A., Harris, L., Mosaku, A., Chang, C., Gignoux, C. R., ... & Barroso, I. (2022). A community driven GWAS summary statistics standard. bioRxiv.
  2. GWAS Catalog
    • SOURCE: https://www.ebi.ac.uk/gwas/docs/summary-statistics-format
    • CITATION: Buniello, A., MacArthur, J. A. L., Cerezo, M., Harris, L. W., Hayhurst, J., Malangone, C., ... & Parkinson, H. (2019). The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic acids research, 47(D1), D1005-D1012.
  3. metal
  4. pgscatalog
    • SOURCE: https://www.pgscatalog.org/downloads/#dl_ftp_scoring
    • CITATION: Lambert, S. A., Gil, L., Jupp, S., Ritchie, S. C., Xu, Y., Buniello, A., ... & Inouye, M. (2021). The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nature Genetics, 53(4), 420-425.
  5. saige
  6. regenie
    • SOURCE: https://rgcgithub.github.io/regenie/options/#output
    • CITATION: Mbatchou, J., Barnard, L., Backman, J., Marcketta, A., Kosmicki, J. A., Ziyatdinov, A., ... & Marchini, J. (2021). Computationally efficient whole-genome regression for quantitative and binary traits. Nature genetics, 53(7), 1097-1103.
  7. plink
    • SOURCE: https://www.cog-genomics.org/plink/1.9/formats
    • CITATION:Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., ... & Sham, P. C. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics, 81(3), 559-575.
  8. plink2
    • SOURCE: https://www.cog-genomics.org/plink/2.0/formats
    • CITATION: Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., & Lee, J. J. (2015). Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4(1), s13742-015.
  9. fastgwa
    • SOURCE:https://yanglab.westlake.edu.cn/software/gcta/#fastGWA
    • CITATION:Jiang, L., Zheng, Z., Qi, T., Kemper, K. E., Wray, N. R., Visscher, P. M., & Yang, J. (2019). A resource-efficient tool for mixed model association analysis of large-scale data. Nature genetics, 51(12), 1749-1755.
  10. mrmega
    • SOURCE:https://genomics.ut.ee/en/tools
    • CITATION: Mägi, R., Horikoshi, M., Sofer, T., Mahajan, A., Kitajima, H., Franceschini, N., ... & Morris, A. P. (2017). Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution. Human molecular genetics, 26(18), 3639-3650.
  11. fuma
    • SOURCE:https://fuma.ctglab.nl/tutorial#snp2gene
    • CITATION: Watanabe, K., Taskesen, E., Van Bochoven, A., & Posthuma, D. (2017). Functional mapping and annotation of genetic associations with FUMA. Nature communications, 8(1), 1-11.
  12. ldsc
  13. locuszoom
    • SOURCE:https://my.locuszoom.org/about/
    • CITATION: Pruim, R. J., Welch, R. P., Sanna, S., Teslovich, T. M., Chines, P. S., Gliedt, T. P., ... & Willer, C. J. (2010). LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics, 26(18), 2336-2337.
  14. vcf
    • SOURCE:https://github.com/MRCIEU/gwas-vcf-specification
    • CITATION: Lyon, M. S., Andrews, S. J., Elsworth, B., Gaunt, T. R., Hemani, G., & Marcora, E. (2021). The variant call format provides efficient and robust storage of GWAS summary statistics. Genome biology, 22(1), 1-10.
  15. bolt_lmm

Future update: To add fields in meta_data:

  1. format_cite_name : formal name of the format, e.g. GWAS-SSF v0.1
  2. format_separator : separator used in the format, e.g. \t
  3. format_na : NA notation in the format, e.g. #NA
  4. format_comment : comment line, e.g. #
  5. format_col_order: column order

About

A collection of commonly used format for quick lookup.


Languages

Language:Jupyter Notebook 99.2%Language:Python 0.8%