leeper / UNF

Tools for Creating Universal Numeric Fingerprints for Data

Home Page:https://cloud.r-project.org/package=UNF

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistent UNF values

dhicks opened this issue · comments

This morning I'm working with some data that hasn't been touched since November (over 7 months ago). I'm the maintainer for this data, it lives on my personal machine, and I use UNF to validate which version of the dataset I'm working with. Today I'm getting UNF values that are inconsistent with values calculated last November. I'm getting similar inconsistencies for some of the examples in ?unf (shown below). In particular I'm getting inconsistencies for unf(longley, ver=4, digits=3) and unf(cbind.data.frame(x1,x2),ver=3) and its equivalents. The UNFs for my data were calculated using version 6.

Both calculations were done using UNF version 2.0.6 on the same machine. One potential difference is last November I was using R 3.5.1 and today I'm using R 4.0.0.

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

Put your code here:

library(UNF)

# Version 6 #

### FORTHCOMING ###

# Version 5 #
## vectors

### just numerics
unf5(1:20) # UNF:5:/FIOZM/29oC3TK/IE52m2A==
#> UNF5:/FIOZM/29oC3TK/IE52m2A==
unf5(-3:3, dvn_zero = TRUE) # UNF:5:pwzm1tdPaqypPWRWDeW6Jw==
#> UNF5:pwzm1tdPaqypPWRWDeW6Jw==

### characters and factors
unf5(c('test','1','2','3')) # UNF:5:fH4NJMYkaAJ16OWMEE+zpQ==
#> UNF5:fH4NJMYkaAJ16OWMEE+zpQ==
unf5(as.factor(c('test','1','2','3'))) # UNF:5:fH4NJMYkaAJ16OWMEE+zpQ==
#> UNF5:fH4NJMYkaAJ16OWMEE+zpQ==

### logicals
unf5(c(TRUE,TRUE,FALSE), dvn_zero=TRUE)# UNF:5:DedhGlU7W6o2CBelrIZ3iw==
#> UNF5:DedhGlU7W6o2CBelrIZ3iw==

### missing values
unf5(c(1:5,NA)) # UNF:5:Msnz4m7QVvqBUWxxrE7kNQ==
#> UNF5:Msnz4m7QVvqBUWxxrE7kNQ==

## variable order and object structure is irrelevant
unf(data.frame(1:3,4:6,7:9)) # UNF:5:ukDZSJXck7fn4SlPJMPFTQ==
#> UNF6:ukDZSJXck7fn4SlPJMPFTQ==
unf(data.frame(7:9,1:3,4:6))
#> UNF6:ukDZSJXck7fn4SlPJMPFTQ==
unf(list(1:3,4:6,7:9))
#> UNF6:ukDZSJXck7fn4SlPJMPFTQ==

# Version 4 #
# version 4
data(longley)
unf(longley, ver=4, digits=3) # PjAV6/R6Kdg0urKrDVDzfMPWJrsBn5FfOdZVr9W8Ybg=
#> UNF4:3,128:KjRoxvNqv+Gkbso2DZ5N3lztfFYA02PPy8KlAByze9s=

# version 4.1
unf(longley, ver=4.1, digits=3) # 8nzEDWbNacXlv5Zypp+3YCQgMao/eNusOv/u5GmBj9I=
#> UNF4.1:3,128:8nzEDWbNacXlv5Zypp+3YCQgMao/eNusOv/u5GmBj9I=

# Version 3 #
x1 <- 1:20
x2 <- x1 + .00001

unf3(x1) # HRSmPi9QZzlIA+KwmDNP8w==
#> UNF3:M+FD+2bN2GJGqHJmhZeWig==
unf3(x2) # OhFpUw1lrpTE+csF30Ut4Q==
#> UNF3:cN+0PxPJHvbQQd5I+pLKpg==

# UNFs are identical at specified level of rounding
identical(unf3(x1), unf3(x2))
#> [1] FALSE
identical(unf3(x1, digits=5),unf3(x2, digits=5))
#> [1] TRUE

# dataframes, matrices, and lists are all treated identically:
unf(cbind.data.frame(x1,x2),ver=3) # E8+DS5SG4CSoM7j8KAkC9A==
#> UNF3:eIjrbuHf+6rWU/XD+4F7+g==
unf(list(x1,x2), ver=3)
#> UNF3:eIjrbuHf+6rWU/XD+4F7+g==
unf(cbind(x1,x2), ver=3)
#> UNF3:eIjrbuHf+6rWU/XD+4F7+g==

sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.5
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] UNF_2.0.6
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_4.0.0  magrittr_1.5    tools_4.0.0     htmltools_0.4.0
#>  [5] base64enc_0.1-3 yaml_2.2.1      Rcpp_1.0.4.6    stringi_1.4.6  
#>  [9] rmarkdown_2.1   highr_0.8       knitr_1.28      stringr_1.4.0  
#> [13] xfun_0.13       digest_0.6.25   rlang_0.4.6     evaluate_0.14

Created on 2020-06-27 by the reprex package (v0.3.0)

Thanks for this report. Definitely concerning but I'm wondering if it's unique to R 4.0.0. I'm not seeing these in 4.0.2 nor any issues on CRAN.

It's been a long time since I've looked at this code so it's definitely possible there's a problem but there's an intentionally thorough test suite to catch these kinds of things, so I'm hopeful it's an upstream problem that has since been resolved.

Just a note that I'm still seeing this in R 4.1.2 and UNF 2.0.8: the hashes I'm getting are the same as the ones I reported in the reprex, not what's in the docs. Here's an updated sessionInfo():

R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6.8

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] UNF_2.0.8

loaded via a namespace (and not attached):
[1] compiler_4.1.2  tools_4.1.2     base64enc_0.1-3 digest_0.6.29