ropensci / CoordinateCleaner

Automated flagging of common spatial and temporal errors in biological and palaeontological collection data, for the use in conservation, ecology and palaeontology.

Home Page:https://docs.ropensci.org/CoordinateCleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

c_outl looping over many species

HMB3 opened this issue · comments

Hi,

My name is Hugh and I've got a big data set of occurrences from GBIF and ALA.
There are about 3.8k species, so lots of points to clean!

I'm using the CleanCoordinates function with these settings ::

`library(CoordinateCleaner)
minages <- runif(250, 0, 65)
exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE),
decimallongitude = runif(250, min = 42, max = 51),
decimallatitude = runif(250, min = -26, max = -11),
min_ma = minages,
max_ma = minages + runif(250, 0.1, 65),
dataset = "clean")

exmpl <- exmpl %>%
timetk::tk_tbl()

FLAGS <- CleanCoordinates(exmpl,
capitals.rad = 0.12,
countrycheck = TRUE,
duplicates = TRUE,
seas = FALSE,
verbose = FALSE)`

However, running the spatial outlier detection here is a bit slow, because there
are too many records.

So I'm running the outlier detection separately in this form :

`SPAT.OUT <- as.character(unique(exmpl$species)) %>%

lapply(function(x) {
    
    f <- subset(exmpl, species == x)
    
    message("Running spatial outlier detection for ", x)
    message(dim(f)[1], " records for ", x)

	sp.flag <- cc_outl(f,
                       lon     = "decimallongitude",
                       lat     = "decimallatitude",
                       species = "species",
                       method  = "distance",
                       tdi     = 300,  ## get points 300km from other points?
                       value   = "flags",
                       verbose = "FALSE")
   
    d = cbind(searchTaxon = x,
              SPAT_OUT = sp.flag, f)[c("searchTaxon", "SPAT_OUT")]
    return(d)
    
}) %>%

bind_rows`

I can run this check for about 1000 species at a time, but it uses a lot of RAM (>60 GB) in some
cases. It also doesn't seem to flag many records for some species.

Are there better settings I could use? I need to keep the format of looping over all the species.

Thanks very much - any advice is greatly appreciated!

h

Sorry I can't quite figure out the formatting :]