morinlab / GAMBLR.data

Collection of Curated Data for Genomic Analysis of Mature B-cell Lymphomas in R

Home Page:https://morinlab.github.io/GAMBLR.data/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Discrepancies in how the two objects grch37_gene_coordinates and hg38_gene_coordinates are compiled.

mattssca opened this issue · comments

It was discovered that these two data sets are compiled differently.

grch37_gene_coordinates
This object is being compiled using a bundled .tsv file with gene coordinates in the stated projection. This file can be found here. In DATASET.R under data-raw the bundled R object is created by simply reading in the .tsv file and saving the .Rda to the data folder.

hg38_gene_coordinates
For this object, there is no .tsv file available in the repo. Instead, for this projection, this package creates the bundled data object by first "curling" the gtf.gz file from ensembl's FTP, then using rtracklayer to import it to R, perform some data wrangling and lastly saving the .Rda object to the data folder. See these lines in DATASET.R for more info on how this is done.

Suggested Solution
I think the latter approach should be implemented to compile the grch37_gene_coordinates object as well. This adds reproducibility, traceability and lastly, minimizes the local footprint of the repo (avoiding bundling big .tsv files with gene coordinates). Currently, there is no information as of where the grch37_gene_coordinates.tsv is coming from, what version it is or how it was compiled.