extreme value detection event with new PaulHSCData, not seen in legacy call

Question

extreme value detection event with new PaulHSCData, not seen in legacy call

vjcitn opened this issue 4 months ago · comments

> z = PaulHSCData(legacy=FALSE)
The value -2^31 was detected in the dataset.
This has been converted to NA within R.
The value -2^31 was detected in the dataset.
This has been converted to NA within R.

seen on windows and linux

> sessionInfo()
R Under development (unstable) (2024-01-10 r85797)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /home/vincent/R-4-4-dist/lib/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] scRNAseq_2.17.1             SingleCellExperiment_1.25.0
 [3] SummarizedExperiment_1.33.3 Biobase_2.63.0             
 [5] GenomicRanges_1.55.3        GenomeInfoDb_1.39.6        
 [7] IRanges_2.37.1              S4Vectors_0.41.3           
 [9] BiocGenerics_0.49.1         MatrixGenerics_1.15.0      
[11] matrixStats_1.2.0           rmarkdown_2.25             

loaded via a namespace (and not attached):
 [1] DBI_1.2.2                bitops_1.0-7             httr2_1.0.0             
 [4] biomaRt_2.59.1           rlang_1.1.3              magrittr_2.0.3          
 [7] gypsum_0.99.9            compiler_4.4.0           RSQLite_2.3.5           
[10] GenomicFeatures_1.55.3   png_0.1-8                vctrs_0.6.5             
[13] stringr_1.5.1            ProtGenerics_1.35.2      pkgconfig_2.0.3         
[16] crayon_1.5.2             fastmap_1.1.1            dbplyr_2.4.0            
[19] XVector_0.43.1           utf8_1.2.4               Rsamtools_2.19.3        
[22] bit_4.0.5                xfun_0.42                aws.s3_0.3.21           
[25] zlibbioc_1.49.0          cachem_1.0.8             jsonlite_1.8.8          
[28] progress_1.2.3           blob_1.2.4               rhdf5filters_1.15.2     
[31] DelayedArray_0.29.7      Rhdf5lib_1.25.1          BiocParallel_1.37.0     
[34] parallel_4.4.0           prettyunits_1.2.0        R6_2.5.1                
[37] stringi_1.8.3            rtracklayer_1.63.0       Rcpp_1.0.12             
[40] knitr_1.45               base64enc_0.1-3          Matrix_1.6-4            
[43] tidyselect_1.2.0         abind_1.4-5              yaml_2.3.8              
[46] codetools_0.2-19         curl_5.2.0               lattice_0.22-5          
[49] alabaster.sce_1.3.3      tibble_3.2.1             KEGGREST_1.43.0         
[52] evaluate_0.23            BiocFileCache_2.11.1     alabaster.schemas_1.3.1 
[55] xml2_1.3.6               ExperimentHub_2.11.1     Biostrings_2.71.2       
[58] pillar_1.9.0             BiocManager_1.30.22      filelock_1.0.3          
[61] generics_0.1.3           startup_0.21.0           RCurl_1.98-1.14         
[64] BiocVersion_3.19.1       ensembldb_2.27.1         hms_1.1.3               
[67] alabaster.base_1.3.20    alabaster.ranges_1.3.3   glue_1.7.0              
[70] alabaster.matrix_1.3.12  lazyeval_0.2.2           tools_4.4.0             
[73] AnnotationHub_3.11.1     BiocIO_1.13.0            GenomicAlignments_1.39.4
[76] XML_3.99-0.16.1          rhdf5_2.47.4             grid_4.4.0              
[79] AnnotationDbi_1.65.2     GenomeInfoDbData_1.2.11  HDF5Array_1.31.5        
[82] restfulr_0.0.15          cli_3.6.2                rappdirs_0.3.3          
[85] fansi_1.0.6              S4Arrays_1.3.4           dplyr_1.1.4             
[88] AnnotationFilter_1.27.0  alabaster.se_1.3.4       digest_0.6.34           
[91] SparseArray_1.3.4        rjson_0.2.21             memoise_2.0.1           
[94] htmltools_0.5.7          lifecycle_1.0.4          httr_1.4.7              
[97] aws.signature_0.6.0      bit64_4.0.5             
>

Aaron Lun · Answer 1 · Wed Mar 06 2024 05:46:33 GMT+0800 (China Standard Time)

This isn't a problem in and of itself, rhdf5 is just producing unnecessary messages here.

The cause is our decision to store NA_integer_ (usually from the colData) as -2^31 within the HDF5 file. rhdf5 complains because it thinks that we actually wanted to store an integer -2^31, and that is impossible to represent within an integer vector when reading the file back into R.

The solution is just to silence rhdf5, possibly by providing an option to not emit these messages if the caller knows what they're doing with respect to these integer NAs. @grimbough?

Hervé Pagès · Answer 2 · Wed Mar 06 2024 06:43:12 GMT+0800 (China Standard Time)

I think that the message can be silenced by setting the rhdf5-NA.OK on the dataset. This is something that high-level function rhdf5::h5write() does automatically for you:

library(rhdf5)

m <- matrix(c(1:11,NA), ncol=3)

h5write(m, "test.h5", "m")

h5readAttributes("test.h5", "m")
# $`rhdf5-NA.OK`
# [1] 1

h5read("test.h5", "m")
#      [,1] [,2] [,3]
# [1,]    1    5    9
# [2,]    2    6   10
# [3,]    3    7   11
# [4,]    4    8   NA

However, lower-level function rhdf5::H5Dwrite() does not set this attribute:

h5createDataset("test.h5", "m2", dim(m), storage.mode="integer")
fid <- H5Fopen("test.h5")
did <- H5Dopen(fid, name="m2")
H5Dwrite(did, m)
H5Dclose(did)
H5Fclose(fid)

h5readAttributes("test.h5", "m2")
# list()

h5read("test.h5", "m2")
# The value -2^31 was detected in the dataset.
# This has been converted to NA within R.
#      [,1] [,2] [,3]
# [1,]    1    5    9
# [2,]    2    6   10
# [3,]    3    7   11
# [4,]    4    8   NA

So it looks like the fix is to simply add the rhdf5-NA.OK attribute to these datasets.

Aaron Lun · Answer 3 · Wed Mar 06 2024 06:48:50 GMT+0800 (China Standard Time)

Yes, if the dataset was both generated and read by rhdf5 only. However, there are clients in other frameworks (e.g., Python, Javascript) that are writing these files, and it is a bit odd to ask them to insert an rhdf5-NA.ok attribute just to silence rhdf5.

Hervé Pagès · Answer 4 · Wed Mar 06 2024 07:04:05 GMT+0800 (China Standard Time)

Using -2^31 to represent a missing value must be conveyed one way or another, and that "special meaning" should be part of the data itself. Otherwise how are your Python or Java clients going to know about the real meaning of -2^31 here?

You could argue that the name of the attrribute (rhdf5-NA.OK) is a little bit unfortunate though, as it suggests that this is an rhdf5 thing. But it's not. This is essential language-agnostic metadata.

Aaron Lun · Answer 5 · Wed Mar 06 2024 07:06:30 GMT+0800 (China Standard Time)

We have a separate placeholder attribute that provides more flexibility for encoding missingness, see here. I don't expect rhdf5 to respect this, I just want it to be quiet while I handle things in alabaster.base.

Hervé Pagès · Answer 6 · Wed Mar 06 2024 07:51:55 GMT+0800 (China Standard Time)

How about using suppressMessages(h5read(...)) or suppressMessages(H5Dread(...)) in alabaster.base then? Since ArtifactDB datasets follow a strict policy, the risk of hiding other critical messages from the end user seems very very low.

Aaron Lun · Answer 7 · Wed Mar 06 2024 08:13:54 GMT+0800 (China Standard Time)

Yeah, I guess. Absent any other solution, suppressMessages would be the way. I don't like wholesale suppression but currently there isn't another solution.

... which is a shame. If H5Dread is intended to be a low-level function, it should just shut up and give me the data without complaints. For example, H5Dwrite doesn't try to do anything extra with NAs; for symmetry purposes if nothing else, H5Dread should just be similarly NA-unaware.

NA checks should be shifted to a higher level function - h5read, perhaps, given that h5write is also responsible for adding the rhdf5-NA.OK attribute.

Hervé Pagès · Answer 8 · Wed Mar 06 2024 08:26:53 GMT+0800 (China Standard Time)

Agreed. Symmetry is important.

Mike Smith · Answer 9 · Wed Mar 06 2024 16:33:21 GMT+0800 (China Standard Time)

I don't think I have much more to add to the context here. @hpages summarise the intent behind that message well; it's to inform someone reading non-R generated data that INT_MIN will be represented as NA in R. It's just a message, not even a warning, and suppresMessages() would be my immediate suggestion.

You could argue that the name of the attrribute (rhdf5-NA.OK) is a little bit unfortunate though, as it suggests that this is an rhdf5 thing. But it's not. This is essential language-agnostic metadata.

I think I choose the name to indicate the specific software that added the attribute, in the case that someone recieved data written with rhdf5 and wanted to know what the attriubte meant. Hopefully a google would find something relevant. Perhaps it could have been R-NA.OK or similar, but given tools like hdf5r don't do this I wanted to make the source somewhat explicit. I had the vision that the may be additional metadata added over time and the could all be stored with the rhdf5- naming convention, but that hasn't come to pass.

NA checks should be shifted to a higher level function - h5read, perhaps, given that h5write is also responsible for adding the rhdf5-NA.OK attribute.

This I can get behind. I agree that the symmetry between the various h5 functions and (likewise H5<Xfoo>) is desireable. I'm pretty sure the check for printing that message is already in a seperate function, so I can probably just move where it is called.

Mike Smith · Answer 10 · Mon Mar 11 2024 23:30:15 GMT+0800 (China Standard Time)

The printing of these messages has now been moved h5read as of rhdf5 2.47.6.

Hervé Pagès · Answer 11 · Wed Mar 13 2024 10:50:07 GMT+0800 (China Standard Time)

Thanks @grimbough

Also no more messages when doing PaulHSCData(legacy=FALSE) with scRNAseq 2.17.5. Thanks @LTLA!

@vjcitn This can be closed.