chrchang / plink-ng

A comprehensive update to the PLINK association analysis toolset. Beta testing of the first new version (1.90), focused on speed and memory efficiency improvements, is finishing up. Development is now focused on building out support for multiallelic, phased, and dosage data in PLINK 2.0.

Home Page:https://www.cog-genomics.org/plink/2.0/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pgenlib: when reading a .pgen file, must each row in that table be an individual? Or a SNP?

richelbilderbeek opened this issue · comments

Hi @chrchang and other PLINK maintainers.

&TLDR: when reading a .pgen file using pgenlib, am I correct that each row holds an individual?

Here, I start the story with a simple PLINK binary .bed file, that -upon reading using the genio R package- looks like this:

We can see that there are 4 SNPs (snp_1 to and including snp_4) and 3 individuals (eloquently named 1, 2 and 3):

      1 2 3
snp_1 1 0 1
snp_2 0 1 1
snp_3 2 0 0
snp_4 2 0 0

When I convert that PLINK binary .bed file to a PLINK2 binary .pgen file using PLINK2, the .pgen table -upon reading with pgenlib- looks like this:

  snp_1 snp_2 snp_3 snp_4
1     1     0     2     2
2     0     1     0     0
3     1     1     0     0

Also here, we can see that there are 4 SNPs and 3 individuals.

The unexpected thing is that the table is transposed.

Now, the unexpectedness can come from genio or pgenlib, but as https://github.com/chrchang/plink-ng/blob/master/pgen_spec/pgen_spec.pdf has the official .pgen specification, I feel the table should match the .pgen file's make-up. Sadly, from those specs, I could not determine how a .pgen file is ordered: does each line (which becomes a row in a table) hold a SNP (as in genio ) or an individual (as is in pgenlib now).

So, does pgenlib follow the .pgen file format, with each row being an individual?

If needed, there is a reprex below. Also, Already attached are the files in PLINK1 text format, PLINK1 binary format and PLINK2 binary format. I converted these using PLINK (text -> binary) and PLINK2 (PLINK binary -> PLINK2 binary).

Thanks and cheers, Richel Bilderbeek

# Uses the plinkr R package
# First do:
#
#  remotes::install_github("richelbilderbeek/plinkr")
#
# Then do:
#
#  library(plinkr)

test_that("demonstrate .pgen row/column ordering differs from genio's .bed", {

  # Convert data
  # 1. Create an asymmetrical PLINK1 text data set
  # 2. Convert to an asymmetrical PLINK1 binary data set
  # 3. Convert to PLINK2 binary

  # Convert using files only
  # 4. Save PLINK1 text data files
  # 5. Convert to PLINK1 binary data files
  # 6. Convert to PLINK2 binary data files

  # Convert data
  # 1. Create an asymmetrical PLINK1 text data set
  assoc_qt_params <- create_demo_assoc_qt_params()
  plink_text_data <- assoc_qt_params$data
  n_individuals <- nrow(assoc_qt_params$data$ped_table)
  expect_equal(3, n_individuals)
  n_snps <- nrow(assoc_qt_params$data$map_table)
  expect_equal(4, n_snps)
  expect_true(n_snps != n_individuals) # must be asymetric

  # 2. Convert to an asymmetrical PLINK1 binary data set
  plink_bin_data <- convert_plink_text_data_to_plink_bin_data(plink_text_data)
  expect_equal(nrow(plink_bin_data$fam_table), n_individuals)
  expect_equal(nrow(plink_bin_data$bim_table), n_snps)
  #' @param bed_table a table that maps the SNPs to the individuals,
  #' of which the column names are the names of the individuals,
  #' the row names are the names of the SNPs,
  #' and the values are the SNP variant.
  expect_equal(ncol(plink_bin_data$bed_table), n_individuals)
  expect_equal(nrow(plink_bin_data$bed_table), n_snps)

  # 3. Convert to an asymmetrical PLINK2 binary data set
  plink2_bin_data <- convert_plink_bin_data_to_plink2_bin_data(plink_bin_data)
  expect_equal(nrow(plink2_bin_data$psam_table), n_individuals)
  expect_equal(nrow(plink2_bin_data$pvar_table), n_snps)
  #' @param pgen_table an \link{array} that maps the individuals
  #'   to their SNPs, with as much rows as individuals, and as much
  #'   SNPs as columns. Optionally, the row names are the individuals' IDs,
  #'   where the column names are the SNP ID's
  expect_equal(nrow(plink2_bin_data$pgen_table), n_individuals)
  expect_equal(ncol(plink2_bin_data$pgen_table), n_snps)

  # Convert using files only
  # 4. Save PLINK1 text data files
  # 5. Convert to PLINK1 binary data files
  # 6. Convert to PLINK2 binary data files
  folder <- get_plinkr_tempfilename()
  plink_text_base_input_filename <- file.path(folder, "plink_text")
  plink_bin_base_input_filename <- file.path(folder, "plink_bin")
  plink2_bin_base_input_filename <- file.path(folder, "plink2_bin")
  save_plink_text_data(
    plink_text_data,
    base_input_filename = plink_text_base_input_filename
  )
  convert_plink_text_files_to_plink_bin_files(
    base_input_filename = plink_text_base_input_filename,
    base_output_filename = plink_bin_base_input_filename
  )
  convert_plink_bin_files_to_plink2_bin_files(
    base_input_filename = plink_bin_base_input_filename,
    base_output_filename = plink2_bin_base_input_filename
  )

  # Each row holds a SNP
  #
  #       1 2 3
  # snp_1 1 0 1
  # snp_2 0 1 1
  # snp_3 2 0 0
  # snp_4 2 0 0

  read_plink_bed_file_from_files(
    bed_filename = paste0(plink_bin_base_input_filename, ".bed"),
    bim_filename = paste0(plink_bin_base_input_filename, ".bim"),
    fam_filename = paste0(plink_bin_base_input_filename, ".fam")
  )

  # Each row holds an individual
  #
  #   snp_1 snp_2 snp_3 snp_4
  # 1     1     0     2     2
  # 2     0     1     0     0
  # 3     1     1     0     0

  read_plink2_pgen_file_from_files(
    pgen_filename = paste0(plink2_bin_base_input_filename, ".pgen"),
    psam_filename = paste0(plink2_bin_base_input_filename, ".psam"),
    pvar_filename = paste0(plink2_bin_base_input_filename, ".pvar")
  )
})

The underlying file is variant-major. Individual library functions may perform transposition; you need to look at their documentation.

The underlying file is variant-major. Individual library functions may perform transposition; you need to look at their documentation.

Thanks, your reply will help me doing that and/or improve other libraries :-)