cc_outl method='distance' always returns TRUE
CyanBC opened this issue · comments
When running cc_outl with method='distance' the result is (usually) true regardless of tdi value
xy <- data.table(longitude=c(0,90,10,-120),latitude=c(0,90,10,-45),species=c('t1','t1','t1','t1'))
cc_outl(xy, method='distance', tdi=10, value = 'flagged')
cc_outl(xy, method='distance', tdi=1000000, value = 'flagged')
Hi,
this is related to the minimum number of geographically unique records needed. There is a minimum number of geographically unique records needed to estimate if a record is an outlier; species with fewer records cannot be tested. The default is to seven, so this works:
library(CoordianteCleaner)
xy <- data.table(decimallongitude=c(0,1,2,3,2,1,0,-120),
decimallatitude=c(3,2,1,3,2,1,1,-45),
species='t1')
cc_outl(xy, method='distance', tdi=10, value = 'flagged')
cc_outl(xy, method='distance', tdi=1000000, value = 'flagged')
This limitation was a bit unclear, so in the latest version (2.0-7) now allows to control the minimum number of unique points needed via the min_occs argument and will produce a warning when there are species with less than min_occs records in the dataset. For method ='distance' it might be OK to set it to four.
Say I wanted to mark as outliers points that are more than 200km from the next nearest record.
Here is my real world example:
test = data.frame(latitude=c(-23.85,-23.51667,-0.983333,-25.71667,-0.7,-0.7,-0.7,-0.4,-27.08333,-23.36667,-21.1,-21.1,-21.1,-20.86667,-20.86667,-16.03333,-16.03333,-28.9,-31.4,-10.16667),
longitude=c(-46.15,-46.18333,-77.81667,-54.41667,-76.35,-76.35,-76.35,-76.61667,-53.13333,-46.18333,-44.5,-44.5,-44.5,-45.51667,-45.51667,-41.21667,-41.21667,-50.38333,-53.93333,-67.83333),
species=c("species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name"))
dist_test = cc_outl(test, lon='longitude', lat='latitude', species='species', method = 'distance', tdi=200, value='flagged', verbose=F)
print(dist_test)
We can test this another way using the same test data:
library(geosphere)
library(matrixStats)
dist_matrix = distm( test[,c('longitude', 'latitude')], fun=distGeo)/1000 #original in meters
dist_matrix[dist_matrix == 0] = NA
near_pt_dist = colMins(dist_matrix, na.rm=T) < 200
print(near_pt_dist)
yes, good idea. Thanks for the alternative. I will close the issue now.