ropensci / CoordinateCleaner

Automated flagging of common spatial and temporal errors in biological and palaeontological collection data, for the use in conservation, ecology and palaeontology.

Home Page:https://docs.ropensci.org/CoordinateCleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bug: wrong record flagged in clean_coordinates using data.frame

mtalluto opened this issue · comments

I'm running into a weird issue when running clean_coordinates. Reproducible example below. I have a data set with an obvious outlier. The basic workflow is 1. Get data from GBIF, (1a) convert tibble to data.frame, 2. Remove NA coords using subset, 3. Run clean_coordinates.

If I skip step 1a (i.e., send a tibble to clean_coordinates instead of a data.frame), the outlier is correctly flagged. If I do step 1a, a record is flagged, but it is the wrong row. If I convert to data.frame AFTER subset, everything is fine. If use square brackets instead of subset.data.frame, everything is fine. I have verified that the data in all columns is identical after subsetting, regardless of whether I subset on the tibble or the data.frame.

R 3.63, macOS 10.15.7, rgbif 3.5.2, CoordinateCleaner 2.0.18, tibble 3.0.4.

library(rgbif)
library(CoordinateCleaner)

dat = occ_search(scientificName = "Sorex alpinus", limit=250)$data
dat_df = as.data.frame(dat)
dat_df_no_subset = dat_df_no_subset = as.data.frame(dat)

dat = subset(dat, !is.na(decimalLatitude))
dat_df = subset(dat_df, !is.na(decimalLatitude))
dat_df_no_subset = dat_df_no_subset[!is.na(dat_df_no_subset$decimalLatitude),]

## all of the data in all 3 tables is identical, as expected
all(mapply(identical, dat, dat_df))
all(mapply(identical, dat, dat_df_no_subset))

cl = clean_coordinates(dat, lon="decimalLongitude", lat="decimalLatitude", tests="outliers")
cl_df = clean_coordinates(dat_df, lon="decimalLongitude", lat="decimalLatitude", tests="outliers")
cl_conv_late = clean_coordinates(as.data.frame(dat), lon="decimalLongitude", 
	lat="decimalLatitude", tests="outliers")
cl_no_subset = clean_coordinates(as.data.frame(dat), lon="decimalLongitude", 
	lat="decimalLatitude", tests="outliers")

## different records are flagged, but only if converting to data frame before using subset
which(! cl$.summary)
which(! cl_df$.summary)
which(! cl_conv_late$.summary)
which(! cl_no_subset$.summary)