c_outl looping over many species
HMB3 opened this issue · comments
Hi,
My name is Hugh and I've got a big data set of occurrences from GBIF and ALA.
There are about 3.8k species, so lots of points to clean!
I'm using the CleanCoordinates function with these settings ::
`library(CoordinateCleaner)
minages <- runif(250, 0, 65)
exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE),
decimallongitude = runif(250, min = 42, max = 51),
decimallatitude = runif(250, min = -26, max = -11),
min_ma = minages,
max_ma = minages + runif(250, 0.1, 65),
dataset = "clean")
exmpl <- exmpl %>%
timetk::tk_tbl()
FLAGS <- CleanCoordinates(exmpl,
capitals.rad = 0.12,
countrycheck = TRUE,
duplicates = TRUE,
seas = FALSE,
verbose = FALSE)`
However, running the spatial outlier detection here is a bit slow, because there
are too many records.
So I'm running the outlier detection separately in this form :
`SPAT.OUT <- as.character(unique(exmpl$species)) %>%
lapply(function(x) {
f <- subset(exmpl, species == x)
message("Running spatial outlier detection for ", x)
message(dim(f)[1], " records for ", x)
sp.flag <- cc_outl(f,
lon = "decimallongitude",
lat = "decimallatitude",
species = "species",
method = "distance",
tdi = 300, ## get points 300km from other points?
value = "flags",
verbose = "FALSE")
d = cbind(searchTaxon = x,
SPAT_OUT = sp.flag, f)[c("searchTaxon", "SPAT_OUT")]
return(d)
}) %>%
bind_rows`
I can run this check for about 1000 species at a time, but it uses a lot of RAM (>60 GB) in some
cases. It also doesn't seem to flag many records for some species.
Are there better settings I could use? I need to keep the format of looping over all the species.
Thanks very much - any advice is greatly appreciated!
h
Sorry I can't quite figure out the formatting :]