LTLA / scRNAseq

Clone of the Bioconductor repository for the scRNAseq package.

Home Page:http://bioconductor.org/packages/devel/data/experiment/html/scRNAseq.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

extreme value detection event with new PaulHSCData, not seen in legacy call

vjcitn opened this issue · comments

> z = PaulHSCData(legacy=FALSE)
The value -2^31 was detected in the dataset.
This has been converted to NA within R.
The value -2^31 was detected in the dataset.
This has been converted to NA within R.

seen on windows and linux

> sessionInfo()
R Under development (unstable) (2024-01-10 r85797)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /home/vincent/R-4-4-dist/lib/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] scRNAseq_2.17.1             SingleCellExperiment_1.25.0
 [3] SummarizedExperiment_1.33.3 Biobase_2.63.0             
 [5] GenomicRanges_1.55.3        GenomeInfoDb_1.39.6        
 [7] IRanges_2.37.1              S4Vectors_0.41.3           
 [9] BiocGenerics_0.49.1         MatrixGenerics_1.15.0      
[11] matrixStats_1.2.0           rmarkdown_2.25             

loaded via a namespace (and not attached):
 [1] DBI_1.2.2                bitops_1.0-7             httr2_1.0.0             
 [4] biomaRt_2.59.1           rlang_1.1.3              magrittr_2.0.3          
 [7] gypsum_0.99.9            compiler_4.4.0           RSQLite_2.3.5           
[10] GenomicFeatures_1.55.3   png_0.1-8                vctrs_0.6.5             
[13] stringr_1.5.1            ProtGenerics_1.35.2      pkgconfig_2.0.3         
[16] crayon_1.5.2             fastmap_1.1.1            dbplyr_2.4.0            
[19] XVector_0.43.1           utf8_1.2.4               Rsamtools_2.19.3        
[22] bit_4.0.5                xfun_0.42                aws.s3_0.3.21           
[25] zlibbioc_1.49.0          cachem_1.0.8             jsonlite_1.8.8          
[28] progress_1.2.3           blob_1.2.4               rhdf5filters_1.15.2     
[31] DelayedArray_0.29.7      Rhdf5lib_1.25.1          BiocParallel_1.37.0     
[34] parallel_4.4.0           prettyunits_1.2.0        R6_2.5.1                
[37] stringi_1.8.3            rtracklayer_1.63.0       Rcpp_1.0.12             
[40] knitr_1.45               base64enc_0.1-3          Matrix_1.6-4            
[43] tidyselect_1.2.0         abind_1.4-5              yaml_2.3.8              
[46] codetools_0.2-19         curl_5.2.0               lattice_0.22-5          
[49] alabaster.sce_1.3.3      tibble_3.2.1             KEGGREST_1.43.0         
[52] evaluate_0.23            BiocFileCache_2.11.1     alabaster.schemas_1.3.1 
[55] xml2_1.3.6               ExperimentHub_2.11.1     Biostrings_2.71.2       
[58] pillar_1.9.0             BiocManager_1.30.22      filelock_1.0.3          
[61] generics_0.1.3           startup_0.21.0           RCurl_1.98-1.14         
[64] BiocVersion_3.19.1       ensembldb_2.27.1         hms_1.1.3               
[67] alabaster.base_1.3.20    alabaster.ranges_1.3.3   glue_1.7.0              
[70] alabaster.matrix_1.3.12  lazyeval_0.2.2           tools_4.4.0             
[73] AnnotationHub_3.11.1     BiocIO_1.13.0            GenomicAlignments_1.39.4
[76] XML_3.99-0.16.1          rhdf5_2.47.4             grid_4.4.0              
[79] AnnotationDbi_1.65.2     GenomeInfoDbData_1.2.11  HDF5Array_1.31.5        
[82] restfulr_0.0.15          cli_3.6.2                rappdirs_0.3.3          
[85] fansi_1.0.6              S4Arrays_1.3.4           dplyr_1.1.4             
[88] AnnotationFilter_1.27.0  alabaster.se_1.3.4       digest_0.6.34           
[91] SparseArray_1.3.4        rjson_0.2.21             memoise_2.0.1           
[94] htmltools_0.5.7          lifecycle_1.0.4          httr_1.4.7              
[97] aws.signature_0.6.0      bit64_4.0.5             
> 

This isn't a problem in and of itself, rhdf5 is just producing unnecessary messages here.

The cause is our decision to store NA_integer_ (usually from the colData) as -2^31 within the HDF5 file. rhdf5 complains because it thinks that we actually wanted to store an integer -2^31, and that is impossible to represent within an integer vector when reading the file back into R.

The solution is just to silence rhdf5, possibly by providing an option to not emit these messages if the caller knows what they're doing with respect to these integer NAs. @grimbough?

I think that the message can be silenced by setting the rhdf5-NA.OK on the dataset. This is something that high-level function rhdf5::h5write() does automatically for you:

library(rhdf5)

m <- matrix(c(1:11,NA), ncol=3)

h5write(m, "test.h5", "m")

h5readAttributes("test.h5", "m")
# $`rhdf5-NA.OK`
# [1] 1

h5read("test.h5", "m")
#      [,1] [,2] [,3]
# [1,]    1    5    9
# [2,]    2    6   10
# [3,]    3    7   11
# [4,]    4    8   NA

However, lower-level function rhdf5::H5Dwrite() does not set this attribute:

h5createDataset("test.h5", "m2", dim(m), storage.mode="integer")
fid <- H5Fopen("test.h5")
did <- H5Dopen(fid, name="m2")
H5Dwrite(did, m)
H5Dclose(did)
H5Fclose(fid)

h5readAttributes("test.h5", "m2")
# list()

h5read("test.h5", "m2")
# The value -2^31 was detected in the dataset.
# This has been converted to NA within R.
#      [,1] [,2] [,3]
# [1,]    1    5    9
# [2,]    2    6   10
# [3,]    3    7   11
# [4,]    4    8   NA

So it looks like the fix is to simply add the rhdf5-NA.OK attribute to these datasets.

Yes, if the dataset was both generated and read by rhdf5 only. However, there are clients in other frameworks (e.g., Python, Javascript) that are writing these files, and it is a bit odd to ask them to insert an rhdf5-NA.ok attribute just to silence rhdf5.

Using -2^31 to represent a missing value must be conveyed one way or another, and that "special meaning" should be part of the data itself. Otherwise how are your Python or Java clients going to know about the real meaning of -2^31 here?

You could argue that the name of the attrribute (rhdf5-NA.OK) is a little bit unfortunate though, as it suggests that this is an rhdf5 thing. But it's not. This is essential language-agnostic metadata.

We have a separate placeholder attribute that provides more flexibility for encoding missingness, see here. I don't expect rhdf5 to respect this, I just want it to be quiet while I handle things in alabaster.base.

How about using suppressMessages(h5read(...)) or suppressMessages(H5Dread(...)) in alabaster.base then? Since ArtifactDB datasets follow a strict policy, the risk of hiding other critical messages from the end user seems very very low.

Yeah, I guess. Absent any other solution, suppressMessages would be the way. I don't like wholesale suppression but currently there isn't another solution.

... which is a shame. If H5Dread is intended to be a low-level function, it should just shut up and give me the data without complaints. For example, H5Dwrite doesn't try to do anything extra with NAs; for symmetry purposes if nothing else, H5Dread should just be similarly NA-unaware.

NA checks should be shifted to a higher level function - h5read, perhaps, given that h5write is also responsible for adding the rhdf5-NA.OK attribute.

Agreed. Symmetry is important.

I don't think I have much more to add to the context here. @hpages summarise the intent behind that message well; it's to inform someone reading non-R generated data that INT_MIN will be represented as NA in R. It's just a message, not even a warning, and suppresMessages() would be my immediate suggestion.

You could argue that the name of the attrribute (rhdf5-NA.OK) is a little bit unfortunate though, as it suggests that this is an rhdf5 thing. But it's not. This is essential language-agnostic metadata.

I think I choose the name to indicate the specific software that added the attribute, in the case that someone recieved data written with rhdf5 and wanted to know what the attriubte meant. Hopefully a google would find something relevant. Perhaps it could have been R-NA.OK or similar, but given tools like hdf5r don't do this I wanted to make the source somewhat explicit. I had the vision that the may be additional metadata added over time and the could all be stored with the rhdf5- naming convention, but that hasn't come to pass.

NA checks should be shifted to a higher level function - h5read, perhaps, given that h5write is also responsible for adding the rhdf5-NA.OK attribute.

This I can get behind. I agree that the symmetry between the various h5 functions and (likewise H5<Xfoo>) is desireable. I'm pretty sure the check for printing that message is already in a seperate function, so I can probably just move where it is called.

The printing of these messages has now been moved h5read as of rhdf5 2.47.6.

Thanks @grimbough

Also no more messages when doing PaulHSCData(legacy=FALSE) with scRNAseq 2.17.5. Thanks @LTLA!

@vjcitn This can be closed.