ropensci / CoordinateCleaner

Automated flagging of common spatial and temporal errors in biological and palaeontological collection data, for the use in conservation, ecology and palaeontology.

Home Page:https://docs.ropensci.org/CoordinateCleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cc_outl method='distance' always returns TRUE

CyanBC opened this issue · comments

When running cc_outl with method='distance' the result is (usually) true regardless of tdi value

xy <- data.table(longitude=c(0,90,10,-120),latitude=c(0,90,10,-45),species=c('t1','t1','t1','t1'))
cc_outl(xy, method='distance', tdi=10, value = 'flagged')
cc_outl(xy, method='distance', tdi=1000000, value = 'flagged')

Hi,

this is related to the minimum number of geographically unique records needed. There is a minimum number of geographically unique records needed to estimate if a record is an outlier; species with fewer records cannot be tested. The default is to seven, so this works:

library(CoordianteCleaner)
xy <- data.table(decimallongitude=c(0,1,2,3,2,1,0,-120), 
                 decimallatitude=c(3,2,1,3,2,1,1,-45),
                 species='t1')
cc_outl(xy, method='distance', tdi=10, value = 'flagged')
cc_outl(xy, method='distance', tdi=1000000, value = 'flagged')

This limitation was a bit unclear, so in the latest version (2.0-7) now allows to control the minimum number of unique points needed via the min_occs argument and will produce a warning when there are species with less than min_occs records in the dataset. For method ='distance' it might be OK to set it to four.

Say I wanted to mark as outliers points that are more than 200km from the next nearest record.

Here is my real world example:

test = data.frame(latitude=c(-23.85,-23.51667,-0.983333,-25.71667,-0.7,-0.7,-0.7,-0.4,-27.08333,-23.36667,-21.1,-21.1,-21.1,-20.86667,-20.86667,-16.03333,-16.03333,-28.9,-31.4,-10.16667),
                          longitude=c(-46.15,-46.18333,-77.81667,-54.41667,-76.35,-76.35,-76.35,-76.61667,-53.13333,-46.18333,-44.5,-44.5,-44.5,-45.51667,-45.51667,-41.21667,-41.21667,-50.38333,-53.93333,-67.83333),
                          species=c("species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name","species.name"))
        
        dist_test = cc_outl(test, lon='longitude', lat='latitude', species='species', method = 'distance', tdi=200, value='flagged', verbose=F)
        
        print(dist_test)

We can test this another way using the same test data:

library(geosphere)
library(matrixStats)
dist_matrix = distm( test[,c('longitude', 'latitude')], fun=distGeo)/1000 #original in meters
dist_matrix[dist_matrix == 0] = NA
near_pt_dist = colMins(dist_matrix, na.rm=T) < 200
print(near_pt_dist)

yes, good idea. Thanks for the alternative. I will close the issue now.