pgenlib: when reading a .pgen file, must each row in that table be an individual? Or a SNP?
richelbilderbeek opened this issue · comments
Hi @chrchang and other PLINK maintainers.
&TLDR: when reading a .pgen
file using pgenlib
, am I correct that each row holds an individual?
Here, I start the story with a simple PLINK binary .bed
file, that -upon reading using the genio
R package- looks like this:
We can see that there are 4 SNPs (snp_1
to and including snp_4
) and 3 individuals (eloquently named 1
, 2
and 3
):
1 2 3
snp_1 1 0 1
snp_2 0 1 1
snp_3 2 0 0
snp_4 2 0 0
When I convert that PLINK binary .bed file to a PLINK2 binary .pgen file using PLINK2, the .pgen table -upon reading with pgenlib- looks like this:
snp_1 snp_2 snp_3 snp_4
1 1 0 2 2
2 0 1 0 0
3 1 1 0 0
Also here, we can see that there are 4 SNPs and 3 individuals.
The unexpected thing is that the table is transposed.
Now, the unexpectedness can come from genio
or pgenlib
, but as https://github.com/chrchang/plink-ng/blob/master/pgen_spec/pgen_spec.pdf has the official .pgen
specification, I feel the table should match the .pgen
file's make-up. Sadly, from those specs, I could not determine how a .pgen
file is ordered: does each line (which becomes a row in a table) hold a SNP (as in genio
) or an individual (as is in pgenlib
now).
So, does pgenlib
follow the .pgen
file format, with each row being an individual?
If needed, there is a reprex below. Also, Already attached are the files in PLINK1 text format, PLINK1 binary format and PLINK2 binary format. I converted these using PLINK (text -> binary) and PLINK2 (PLINK binary -> PLINK2 binary).
Thanks and cheers, Richel Bilderbeek
# Uses the plinkr R package
# First do:
#
# remotes::install_github("richelbilderbeek/plinkr")
#
# Then do:
#
# library(plinkr)
test_that("demonstrate .pgen row/column ordering differs from genio's .bed", {
# Convert data
# 1. Create an asymmetrical PLINK1 text data set
# 2. Convert to an asymmetrical PLINK1 binary data set
# 3. Convert to PLINK2 binary
# Convert using files only
# 4. Save PLINK1 text data files
# 5. Convert to PLINK1 binary data files
# 6. Convert to PLINK2 binary data files
# Convert data
# 1. Create an asymmetrical PLINK1 text data set
assoc_qt_params <- create_demo_assoc_qt_params()
plink_text_data <- assoc_qt_params$data
n_individuals <- nrow(assoc_qt_params$data$ped_table)
expect_equal(3, n_individuals)
n_snps <- nrow(assoc_qt_params$data$map_table)
expect_equal(4, n_snps)
expect_true(n_snps != n_individuals) # must be asymetric
# 2. Convert to an asymmetrical PLINK1 binary data set
plink_bin_data <- convert_plink_text_data_to_plink_bin_data(plink_text_data)
expect_equal(nrow(plink_bin_data$fam_table), n_individuals)
expect_equal(nrow(plink_bin_data$bim_table), n_snps)
#' @param bed_table a table that maps the SNPs to the individuals,
#' of which the column names are the names of the individuals,
#' the row names are the names of the SNPs,
#' and the values are the SNP variant.
expect_equal(ncol(plink_bin_data$bed_table), n_individuals)
expect_equal(nrow(plink_bin_data$bed_table), n_snps)
# 3. Convert to an asymmetrical PLINK2 binary data set
plink2_bin_data <- convert_plink_bin_data_to_plink2_bin_data(plink_bin_data)
expect_equal(nrow(plink2_bin_data$psam_table), n_individuals)
expect_equal(nrow(plink2_bin_data$pvar_table), n_snps)
#' @param pgen_table an \link{array} that maps the individuals
#' to their SNPs, with as much rows as individuals, and as much
#' SNPs as columns. Optionally, the row names are the individuals' IDs,
#' where the column names are the SNP ID's
expect_equal(nrow(plink2_bin_data$pgen_table), n_individuals)
expect_equal(ncol(plink2_bin_data$pgen_table), n_snps)
# Convert using files only
# 4. Save PLINK1 text data files
# 5. Convert to PLINK1 binary data files
# 6. Convert to PLINK2 binary data files
folder <- get_plinkr_tempfilename()
plink_text_base_input_filename <- file.path(folder, "plink_text")
plink_bin_base_input_filename <- file.path(folder, "plink_bin")
plink2_bin_base_input_filename <- file.path(folder, "plink2_bin")
save_plink_text_data(
plink_text_data,
base_input_filename = plink_text_base_input_filename
)
convert_plink_text_files_to_plink_bin_files(
base_input_filename = plink_text_base_input_filename,
base_output_filename = plink_bin_base_input_filename
)
convert_plink_bin_files_to_plink2_bin_files(
base_input_filename = plink_bin_base_input_filename,
base_output_filename = plink2_bin_base_input_filename
)
# Each row holds a SNP
#
# 1 2 3
# snp_1 1 0 1
# snp_2 0 1 1
# snp_3 2 0 0
# snp_4 2 0 0
read_plink_bed_file_from_files(
bed_filename = paste0(plink_bin_base_input_filename, ".bed"),
bim_filename = paste0(plink_bin_base_input_filename, ".bim"),
fam_filename = paste0(plink_bin_base_input_filename, ".fam")
)
# Each row holds an individual
#
# snp_1 snp_2 snp_3 snp_4
# 1 1 0 2 2
# 2 0 1 0 0
# 3 1 1 0 0
read_plink2_pgen_file_from_files(
pgen_filename = paste0(plink2_bin_base_input_filename, ".pgen"),
psam_filename = paste0(plink2_bin_base_input_filename, ".psam"),
pvar_filename = paste0(plink2_bin_base_input_filename, ".pvar")
)
})
The underlying file is variant-major. Individual library functions may perform transposition; you need to look at their documentation.
The underlying file is variant-major. Individual library functions may perform transposition; you need to look at their documentation.
Thanks, your reply will help me doing that and/or improve other libraries :-)