Improve geospatial_kosher filter

Question

Improve geospatial_kosher filter

nickdos opened this issue 4 years ago · comments

Placeholder for Lee's comments on improving geospatial_kosher filter

Peggy Newman · Answer 1 · Tue Apr 21 2020 15:24:21 GMT+0800 (China Standard Time)

As per Email: Definition of “Spatially valid”

We don't have an easily accessible definition of "Spatially valid" so I figured this needed to be addressed and amended. The definitions of the tests that the ALA runs can be found https://biocache-ws.ala.org.au/ws/assertions/codes. The current formal definition of "Spatially valid" is isFatal: true and code < 10000, that is the following tests/flags

Supplied coordinates are zero
Suspected outlier
Unable to convert UTM coordinates
Zero latitude
Unparseable verbatim coordinates
Coordinates centre of country
Geospatial issue ?? No idea what this means (from @charvolant - this is a geospatial user annotation)
Coordinates are out of range for species ?? No idea what this means
Outside expert range for species
Supplied coordinates centre of state
Decimal latitude/longitude conversion failed
Zero longitude
Habitat incorrect for species (presume marine species on land or vice-versa)

I would not have included 2, 6, 8, 9 or 10 as FATAL.

Also:
geospatial_kosher:false is applied when there is any failed assertion with code < 10000 (spatial assertions) and isFatal = true.

Dave Martin · Answer 2 · Tue Apr 21 2020 16:01:04 GMT+0800 (China Standard Time)

For 2. Suspected outlier – this is flagged if the record is marked as an outlier for more than 3 environmental layers.

For 8. We have an expert distribution polygon layer for the species, and the record falls outside this polygon.

Ely Wallis · Answer 3 · Wed Apr 22 2020 07:44:32 GMT+0800 (China Standard Time)

Except that for 8 and 9 we only have expert distributions for birds and fish - correct? So there is no expert distribution to test against for the vast majority of species. I'd be happy to hear I'm wrong about that and that we have a lot more expert distribution polygons thatn I know of.

13 is a flawed test as well because I believe it relies on comparison to IRMNG. IRMNG doesn't allow a species to be both terrestrial and marine so you can get an 'error' for any species that can legitimately be found in both places - seals, penguins, migratory birds, shore birds, crocodiles, turtles...

And harping on my pet peeve - if we are going to nominate records as failing geospatial_kosher then we should be much clearer about whether the data provider can do something to fix the records or not

Nick dos Remedios · Answer 4 · Wed Apr 22 2020 07:53:06 GMT+0800 (China Standard Time)

I'd suggest we add a check for user annotations if the assertion type is geospatial_issue.

That way when a user flags a record with dodgy geospatial data, the record is removed from the results when geospatial_kosher=true.

Nick dos Remedios · Answer 5 · Wed Apr 22 2020 07:55:45 GMT+0800 (China Standard Time)

For 2. Suspected outlier – this is flagged if the record is marked as an outlier for more than 3 environmental layers.

@M-Nicholls is already using this as one of the default filters for the UI changes in the DQ project, which suggests feedback from expert users has recommended this check should stay in.

Ely Wallis · Answer 6 · Wed Apr 22 2020 08:14:11 GMT+0800 (China Standard Time)

@nickdos I wouldn't make that assumption - @M-Nicholls hasn't received detailed feedback from users yet about what individual tests should be in or out of the DQ filters.

Peter Ansell · Answer 7 · Wed Apr 22 2020 08:56:52 GMT+0800 (China Standard Time)

For those wanting to review the correlation between the default filters currently in use on the data quality testing site and the list above, the URL to go to is:

https://biocache-dq-test.ala.org.au/occurrences/search?taxa=

Then you hover over the "i" to see what the tests are for each category.

Mesibov · Answer 8 · Wed Apr 22 2020 10:50:28 GMT+0800 (China Standard Time)

A general comment: is there some confusion here about who DQ filters are supposed to assist?
The @elywallis comment that "we should be much clearer about whether the data provider can do something to fix the records or not" is the first mention I've heard in years that ALA might possibly and very politely let providers know their data has problems and they should do something about it.
What's missing from that is the "or" at the end, like "or we're going to exclude all your data". ALA isn't prepared to penalise delinquent data providers, just as ALA itself isn't penalised for publishing bad data.
Unless something's changed recently, DQ tests are strictly for the benefit of users, who can fix or not fix problematic records as they see fit. In this way ALA is no different from the appalling NHM in the UK - see https://iphylo.blogspot.com/2020/03/darwin-core-million-promo-best-and-worst.html

Doug Palmer · Answer 9 · Thu Apr 23 2020 06:44:24 GMT+0800 (China Standard Time)

@elywallis @ansell Newer versions of IRMNG eg https://www.irmng.org/export/2018/ have a different marine/terrestrial/freshwater model that could be used. The downside is that there are many fewer entries. According to Tony 2014 was the last one to include species, since the information at the species level has now become stale.

Lee Belbin · Answer 10 · Thu Apr 23 2020 07:14:25 GMT+0800 (China Standard Time)

There are two significant reasons why I raised this problem in my January report to Andre and Hamish (as posted above by @peggynewman).

The first is that users will not even know what rules are applied in different parts of the ALA, e.g., in the Spatial Portal, by default, 'Spatially valid' is set, so displayed records are filtered. In the Lists tool, viewing the associated records will not filter any records. User feedback suggests that this isn't obvious, and I agree. This same situation applies EVEN if you can find the rules behind the filter. It was not obvious even to me what (7) and (8) meant. If I am uncertain, general users will find it darn hard. I remember asking @nickdos at a meeting, and it took him a while to chase the rules down. Sorry Nick. This lack of easily accessed documentation is to me unacceptable, and provides good reason alone to question our 'DQ'.

The second issue is how we define a composite issue like 'spatially valid'. Right now, I am not happy (as noted in my email above seeding the issue) with including a number of these tests. This is more of a grey area. My point here is IF we have some form of composite filter, we want to avoid eliminating potentially valid records. If we set the bar, then I suggest it has to be set to a point where minimize the potential to eliminate 'valid records'. Where that point is, requires a discussion among informed users. This will be application dependent, as we all should know well by now. We should have a well-documented and well-publicized canned suite of potential filters, as we have started on for the CSDM project.

Personally, as I have mentioned to @M-Nicholls, I would prefer a BASIC filter that flagged the absence of any of the basic NAME-SPACE-TIME fields, as we have called them in the TDWG DQ project. Without at least a valid (dwc:taxonID or dwc:scientificName) and a valid (dwc:decimalLatitude and dwc:decimalLongitude) and a valid (dwc:eventDate or dwc:year) - you have serious problems. You have almost nothing to 'hang your hat on'.

I have to thank @adam-collins for his wonderful work in getting the Spatial Portal and CSDM filtering tool done to testing. This not only makes all the records terms available as potential filters, but also starts to use a vocabulary manager to display associated definitions. This will make life a lot easier, and decisions a lot more explicit.

Nick dos Remedios · Answer 11 · Thu Apr 23 2020 07:26:18 GMT+0800 (China Standard Time)

A short term fix could be to have a persistent button/link in the biocache UI that takes users to a starting page for documentation on searching the biocache.

On this page should be links to various material including:

tutorials and help articles (Knowledge base)
reference pages including:
- SOLR fields reference
- DQ assertions reference

I also think we could really improve the SOLR field reference pages, to include links to more detailed explanations. In particular the geoaptial_kosher field should link off to a page on wiki explaining how and what it does (the crux if this issue).

Lee Belbin · Answer 12 · Thu Apr 23 2020 07:47:03 GMT+0800 (China Standard Time)

@elywallis and @charvolant - TDWG TG2 did retain a terrestrial vs marine test: tdwg/bdq#51. We understood, as with similar tests (e.g., coordinates in 'country') that buffers were required.

The group (John Wieczorek, Paul Morris, Paula Zermoglio, Arthur Chapman and I) were careful to only include what they thought were basic/fundamental/core tests. So, value was seen in this type of test.

It is another matter to ensure when a test flags a record, that the definition of that test is clear and well documented, and it is understood in the context of the results of all the other tests.

I have been dreaming of a time where we have a comprehensive set of 'expert distributions' that can be applied to help identifier environmental outliers. They are STILL just potential outliers even so, due to environmental changes and ecological processes.

My 4c worth on this one.

Peter Ansell · Answer 13 · Thu Apr 23 2020 07:56:43 GMT+0800 (China Standard Time)

@Tasilee If we want to maintain a data quality test based on terrestrial vs marine, we need a fairly comprehensive datasource for species and their habitats. If the datasource is not comprehensive, it may do more harm than good (even with buffers implemented).

As @charvolant says above, the IRMNG habitat datasource that we have relied on so far has lost significant species-level coverage since 2014. We are still using the 2014 version for now, but we will need to move away from that at some point soonish and work with the users of the species_habitats field (mostly AODN that we are aware of) to work on alternative queries or datasources.

Nick dos Remedios · Answer 14 · Thu Apr 23 2020 09:21:36 GMT+0800 (China Standard Time)

I think test 13 (marine, terrestrial habitats) should be removed. It is a constant source of support tickets with people indicating records being flagged as marine when on land and visa versa. As @Tasilee said it needs far more smarts to be useful (like buffers) and our source of data (IRMNG) has removed data, making it even less useful than it was.

Ely Wallis · Answer 15 · Thu Apr 23 2020 11:30:15 GMT+0800 (China Standard Time)

I agree that #13 should be removed - ALA actually adds the marine/terrestrial flag on the basis of species ID with the value coming from IRMNG (I presume). The value is not usually supplied in the data. So we certainly get things happening like ALA processing adds a value according to the species ID, then judges the habitat (that we've added) to be incorrect for the species and the test is failed. Nothing a data provider can do about it but now is judged to have a 'poorer' record according to the DQ test. Example I found recently was an albatross (supposedly terrestrial) found dead on a pier (judged to be a marine environment). I agree that a buffer would have helped in that case but the point still stands that the test isn't useful or informative.

Lee Belbin · Answer 16 · Fri Apr 24 2020 07:10:42 GMT+0800 (China Standard Time)

@elywallis, I think it is more accurate to say that the current ALA implementation of test 13 has potential issues, but I think the principle of the test is a valid one (as TDWG TG2 has concluded).

I raised the 'spatially valid'/'geospatial kosher' issue as it is both poorly documented (to put it mildly) and not properly informed. The first needs an urgent fix (as in 'yesterday') in some form, and the second an informed discussion. My recommendations for dropping tests 2,6,8,9 and 10 from this flag were done from an ecological perspective and uncommon sense. There are others. Anyway, to me, eliminating dubious tests under this flag is a darn good thing!

Doug Palmer · Answer 17 · Fri Apr 24 2020 07:57:30 GMT+0800 (China Standard Time)

Here's what I'm getting to as a series of actions:

Put documentation for what geospatial_kosher means in the knowledgebase.
- As a side-note, do we have the facilities to link to the knowledge base with a URL? Still pushing something like https://nectar-vocab-test.ala.org.au/ as a long-term solution for this kind of stuff
Include 14 = zero latitude and zero longitude, since it's entirely possible that we can have something found on the equator but 0, 0 is almost certainly bogus
Implement geospatial_kosher as 1, 3, 5, 6, 7?, 10, 11, 14
As an option, implement a new term geospatial_suspect as not geospatial_kosher and 2, 4, 8, 9, 12
Leave 13 off, since it's too blunt an instrument

Ely Wallis · Answer 18 · Fri Apr 24 2020 08:34:54 GMT+0800 (China Standard Time)

Yes to including 14 = zero lat and zero long partly because it's likely to be bogus but also because that point falls in the Atlantic ocean near Africa so is outside where you'd usually expect Australian organisations to be out collecting.

Test 7? Geospatial issue ?? No idea what this means - if we can't work out what that is or what it's testing, it has no place in our list

I wouldn't support introducing a "geospatial_suspect" test of 2, 4, 8, 9,12 as I think some of those tests are too poorly supported by reference data / reference library to test against to be worth including.
@Tasilee related to the above - yes, noting and agreeing with your previous comment that it's not the test concept that is poor but the current implementation. If we can improve the implementation then we could look at re-including the test.

Yes agree with leaving off 13 unless / until we can improve the test itself

Lee Belbin · Answer 19 · Fri Apr 24 2020 08:59:40 GMT+0800 (China Standard Time)

TDWG TG2 does have a validation test for lat=lon=0: tdwg/bdq#87 so I think this is a worthwhile addition. This is a very rare event (even by GBIF records) so is a pretty safe test, i.e., to reject a record on this is a pretty safe bet.

As I pointed out in my original post (now in bold), I didn't know what test 7 was testing, 8 was defined by @djtfmartin as what I would have called 9, so we need some clear definitions of the unclear tests relating to this issue (so it is all in one place) before we make any recommendations on changes.

I would not use "kosher" at all, period, full stop! Please. If you want to use a term for pass or fail for this GROUP, then please use "spatially valid" or "spatially suspect" depending on polarity. "Geo" is redundant here.

As a general rule, TDWG TG2 did conclude that users related to 'issues' more than 'passes', even though it is symmetric. Key reason: One hopes to report far fewer issues than non-issues.

Doug Palmer · Answer 20 · Tue Apr 28 2020 09:30:23 GMT+0800 (China Standard Time)

@Tasilee I'm extremely reluctant to get rid of geospatial_kosher as a term in this pass, although I recognise it's a poor term. I can find references to it in the BIE, biocache store, hub and service, spatial and regions service, downloads, install configurations, schemas. And those are the repositories I happen to know about.

If we change the term, each one of these repositories will need to be inspected for indirect references, modified, tested and redeployed.

Since I expect that this will all change with the new DQ and processing architecture, I think that would be a better time to normalise vocabularies, since we'll have everything in bits on the shop floor anyway.

Peter Ansell · Answer 21 · Tue Apr 28 2020 09:44:22 GMT+0800 (China Standard Time)

An alternative to removing the field is to hardcode geospatial_kosher to true on all records temporarily so that user interfaces will not malfunction while a new field named geospatial_suspect that contains the new set of tested negative cases is created.

Lee Belbin · Answer 22 · Tue Apr 28 2020 10:04:37 GMT+0800 (China Standard Time)

My role is to promote meaningful/useful principles, not how they happen or even if they can happen (but I get agro every now and again if we look foolish for extended periods). The vocab manager is a wonderful start that we need to leverage wherever possible. @ansell's idea is interesting.

Nick dos Remedios · Answer 23 · Tue Apr 28 2020 11:02:28 GMT+0800 (China Standard Time)

@ansell I don't think that would work, as the web services clients (ALA4R, etc.) would expect the field to be doing something and hard-coding to true would change (break) the API so that it had no affect.

We could simply implement a new field geospatial_suspect and keep the old one around (as is) until we migrate a new web service version number.

We are expecting that when the infrastructure upgrade goes live, we'll use a new WS version and mark old fields as deprecated. We'll likely implement an API key requirement at the same time. May as well get the painful changes out in one go versus "many cuts" over time.

Peter Ansell · Answer 24 · Tue Apr 28 2020 13:25:10 GMT+0800 (China Standard Time)

@nickdos My impression was that there would not be any API or data deprecations as part of the Stage 1 Upgrade project deployment. Has Stage 1 scope changed to the point where we are implementing data and API level changes with it? The "Upgrade" project has been split up into three stages so there will need to be three deployments, rather than one:

biocache-store->la-pipelines where all data quality checks are implemented, and everything generated by la-pipelines is implemented in the same way that biocache-store functions to enable us to switch internally and verify that we are not losing functionality or data in the process.
collections->registry
biocache-service+everything else->??? with any deprecations necessary, after migrating all of the ALA products to whatever the new web services are and the data that GBIF have implemented

I don't agree that we would be "breaking" anything by hard-coding the geospatial_kosher field to true. The current test is scientifically unusable at this point based on our Scientific Advisor's advice, so there is no scientific merit in maintaining it in its current form for any period of time. By hardcoding it to true we maintain all UI operation functionality without effectively applying our unscientific tests to the data views that users receive.

Nick dos Remedios · Answer 25 · Tue Apr 28 2020 13:44:03 GMT+0800 (China Standard Time)

Thanks @ansell - I'll tackle the breaking API bit first. I think we can tweak the current geospatial_kosher to make it "good enough" for @Tasilee et al., and still provide the intended functionality to keep the API valid - in effect think of it as a bug fix, where we improve the functionality. This would be the short-term fix. The longer term fix is to implement it from first principles based on the working group Lee was on - I see this being done by the DQ project with input from this issue.

There won't be any API deletions/modifications, but there will be some API additions, that we probably won't make a big deal of and mostly use internally. This is a short-term thing until stage 3 where the API will change completely. So the additions will (likely) get a version prefix but not be advertised. We will likely start communicating that the old API will be deprecated in preparation for the GBIF API changes in stage 3. Ideally the additions we're needing to add in stage 1 will be compatible with stage 3 API, as they will be mostly field name changes to use camelCase Darwin Core terms - instead of the underscore terms we currently use in SOLR.

Doug Palmer · Answer 26 · Tue Apr 28 2020 14:50:16 GMT+0800 (China Standard Time)

I've put a draft KB article up in confluence under "What does geospatial_kosher mean?"

Peter Ansell · Answer 27 · Wed Apr 29 2020 06:33:13 GMT+0800 (China Standard Time)

@nickdos Biocache-service does not have enough regression testing to enable simple low-cost changes to be made to any part of its query functionality. Any changes to its query functionality will be high risk and high cost.

Mesibov · Answer 28 · Wed Apr 29 2020 12:18:47 GMT+0800 (China Standard Time)

Another reason to drop #13: Look at "habitat" in this record
https://biocache.ala.org.au/occurrences/c3fa5cb5-5945-42ee-81e6-c352e7b19125
and then check out Lawson Plain on Google Maps or Earth (-41.1 145.18333)
This passed the "Habitat incorrect for species" test.

Doug Palmer · Answer 29 · Thu Apr 30 2020 08:12:56 GMT+0800 (China Standard Time)

I've been doing some experiments to see what including/excluding various filters give us, particularly the centre of country/state which are now the iffy tests.

To start with, I've been looking at Macropus, since it's got a reasonably wide distribution.

With no filter, there are 204254 records. https://biocache.ala.org.au/occurrences/search?taxa=Macropus&fq=#tab_mapView

Applying -geospatial_kosher:true gives 2401 records, largely scattered around the coast many of which are failing the habitat test. There's a lone sea-going kangaroo. https://biocache.ala.org.au/occurrences/search?taxa=Macropus&fq=-geospatial_kosher:true#tab_mapView and https://biocache.ala.org.au/occurrences/search?taxa=Macropus&fq=-geospatial_kosher:true&fq=biome:Marine

Setting geospatial_kosher:false gives 759 records, 730 of which fail the habitat test. https://biocache.ala.org.au/occurrences/search?taxa=Macropus&fq=geospatial_kosher:false&fq=biome:Marine

There are no centre of country records. There is only one centre of state record. https://biocache.ala.org.au/occurrences/cb37aeb6-72a8-4497-ac18-b9c465220c6b This one is almost certainly incorrect, since its a Western Grey Kangaroo in Victoria.

Moving on to plants. I've tried Acacia https://biocache.ala.org.au/occurrences/search?taxa=Acacia#tab_mapView which gives 728,965 records

33,501 are not geospatially kosher https://biocache.ala.org.au/occurrences/search?taxa=Acacia&fq=-geospatial_kosher:true#tab_mapView Again, scattered around the coast with a few sea-going varieties. There are 62 with zero lat/long, 21 at the centre of the state and none centre of country.

The centre of state ones all seem to come from the National Herbarium of NSW https://biocache.ala.org.au/occurrences/search?q=lsid%3Ahttps%3A%2F%2Fid.biodiversity.org.au%2Ftaxon%2Fapni%2F51311124&fq=-geospatial_kosher%3Atrue&fq=assertions%3A%22coordinatesCentreOfStateProvince%22#tab_recordsView I've had a look at the first instances and they're collected in 1917, have a coordinate uncertainty of 10km and generally look as if someone's just sais "NSW" and some processing has just plonked it in the middle.

Now for the general case. For the entire Atlas, there are 10,101,459 (!) not geospatially kosher records https://biocache.ala.org.au/occurrences/search?q=-geospatial_kosher%3Atrue (warning, this can upset the biocache service) 39,260 of these are incorrect habitat.

About 1.5 million of these are records with no geospatial_kosher status. https://biocache.ala.org.au/occurrences/search?q=-geospatial_kosher%3A*
A look at the first few suggests that's because there's no location data, eg. https://biocache.ala.org.au/occurrences/3b4a2a6a-686d-406d-a848-d9d9f64e174c

There are 483 centre of country records https://biocache.ala.org.au/occurrences/search?q=-geospatial_kosher%3Atrue&fq=assertions%3A%22coordinatesCentreOfCountry%22 All of these records are from outside Australia.

there are 681 centre of state records
https://biocache.ala.org.au/occurrences/search?q=-geospatial_kosher%3Atrue&fq=assertions%3A%22coordinatesCentreOfStateProvince%22#tab_recordsView about two thirds of which come from collections an mostly look like historical records.

Conclusions

In the big picture, it really doesn't matter whether we include or exclude the centre of country/state tests, since they have flagged a tiny number of records. However, the tests do seem to be catching records where someone (or something) has basically been pretty slack and just plonked something down.

The centre tests are very sensitive to exactly how something has plonked something in the centre, since they expect a bounding box calculation, rather than something more sophisticated like a centroid or Chebyshev centre. So I suspect that there are a lot of "centre" coordinates that aren't being caught. For example, I'm pretty sure these guys https://biocache.ala.org.au/occurrences/search?q=lsid%3Aurn%3Alsid%3Abiodiversity.org.au%3Aafd.taxon%3Ae9d6fbbd-1505-4073-990a-dc66c930dad6&qc=&wkt=MULTIPOLYGON(((131.85791015625+-24.319568189281284,131.85791015625%20-26.620452177301207,135.63720703125%20-26.620452177301207,135.63720703125%20-24.319568189281284,131.85791015625%20-24.319568189281284)))#tab_mapView are not lost, they've just been placed at some provider's idea of the centre of Australia.

Bottom line, it will do no harm to include the test and actually improve things slightly but we might as well leave them off, since they're probably missing a lot of stuff.

Ely Wallis · Answer 30 · Thu Apr 30 2020 08:29:37 GMT+0800 (China Standard Time)

Thanks @charvolant - loving having some evidence to back up decisions. I wonder if you could run the same set of tests above with our favourite species - magpies? Given that the ALA holds so many records for magpies, I'd be interested to see how the data for that species stacks up.

Nick dos Remedios · Answer 31 · Thu Apr 30 2020 08:32:12 GMT+0800 (China Standard Time)

Comparing -geospatial_kosher:true with geospatial_kosher:false, there is a difference of 1642 records. These 1642 records have no geospatial data (link) ~~and thus are irrelevant to the discussion - I'd suggest keeping to geospatial_kosher:false for such comparison purposes.~~

I think the documentation should make this distinction clear as well. When you are applying geospatial_kosher:true you are excluding records that don't have any coordinate fields. But many of those records (with no coordinates) DO have either state & territory and country set, so are not completely void of spatial information. Are they invalid because they don't provide coordinates?

The rules above might need an additional entry of:
0. Record does not provide coordinate values (either latitude, longitude or decimalLatitude, decimalLongitude)

Doug Palmer · Answer 32 · Thu Apr 30 2020 09:20:47 GMT+0800 (China Standard Time)

@elywallis Ask and ye shall receive

All magipes 1,165,704 https://biocache.ala.org.au/occurrences/search?q=lsid:urn:lsid:biodiversity.org.au:afd.taxon:4e01d6fd-18c9-4169-92ab-7e6d6d9f023f#tab_recordsView

Of which 24,511 are not geospatially kosher https://biocache.ala.org.au/occurrences/search?q=lsid:urn:lsid:biodiversity.org.au:afd.taxon:4e01d6fd-18c9-4169-92ab-7e6d6d9f023f&fq=-geospatial_kosher:true#tab_recordsView or if you prefer 24,090 are treif https://biocache.ala.org.au/occurrences/search?q=lsid:urn:lsid:biodiversity.org.au:afd.taxon:4e01d6fd-18c9-4169-92ab-7e6d6d9f023f&fq=geospatial_kosher:false#tab_recordsView

This is a much lower percentage of non-kosher records than either Macropus or Acacia, which I suspect comes from the large number of GPSed citizen science sightings overwhelming the old stuff in collections.

Of the treif records, 24,086 have a dud habitat https://biocache.ala.org.au/occurrences/search?q=lsid:urn:lsid:biodiversity.org.au:afd.taxon:4e01d6fd-18c9-4169-92ab-7e6d6d9f023f&fq=geospatial_kosher:false&fq=biome:Marine#tab_mapView which is practically all of them.

Of the four that aren't habitat problems, they're all centre of state https://biocache.ala.org.au/occurrences/search?q=lsid:urn:lsid:biodiversity.org.au:afd.taxon:4e01d6fd-18c9-4169-92ab-7e6d6d9f023f&fq=geospatial_kosher:false&fq=-biome:Marine#tab_mapView These all look like upstream systems putting in a "whatever" coordinate. At least iNaturalist gives their record an uncertainty of 881km.

Incidentally, there's an odd location off the East coast which seems to pick up records for some reason. https://biocache.ala.org.au/occurrences/search?q=*:*&qc=&wkt=MULTIPOLYGON(((156.64306640625+-30.75835871256449,156.64306640625%20-34.31394984163214,162.61962890625%20-34.31394984163214,162.61962890625%20-30.75835871256449,156.64306640625%20-30.75835871256449)))#tab_mapView

Is there a "too uncertain" flag in the upcoming DQ stuff?

Lee Belbin · Answer 33 · Thu Apr 30 2020 09:21:50 GMT+0800 (China Standard Time)

Thanks @charvolant . As @elywallis says, useful stats. I also agree with @nickdos that the the kosher subtleties need to be made very clearly - and to say that the TG2 group also has a test that in effect says 'no spatial information at all' (and one for taxa as well).

The fine tuning of 'kosher' (and I still hate typing that word!) is less important than insuring that the implications/outcomes are in user's faces (.i.e., that is is very hard to avoid). In short, "We have done this, because of this, for example". The last bit of that (examples) @charvolant should be a fundamental part of the KB article. I had not noted that when I edited it.

Lee Belbin · Answer 34 · Thu Apr 30 2020 09:27:12 GMT+0800 (China Standard Time)

And BTW, TG2 decided not to discriminate between "warning" and "error" as it was too subtle/grey in many places. We just do, in effect, flags. As we know from the CSDM project, and also @M-Nicholls work on DQ case studies (TDWG DQ TG3), the combination of 'flags' that communities will use is use-dependent. The ALA (and TG2) need to provide the tests that will provide the data for users to select from. Until recently, we have not exposed all the records fields, nor provided a simple method for users to see the data and the 'flags' as raw material on which to evaluate records.

Doug Palmer · Answer 35 · Fri May 01 2020 09:20:56 GMT+0800 (China Standard Time)

There's an updated version of the KB article, with lots of examples and instructions on how to get back the old ways.

Given that the centre tests do almost nothing, I've dropped them off the standard geospatial_kosher test. I've also added the taxonomic issue flag, based on Nick's comments about people confusing geospatial and taxonomic issues.

We also have a taxonomic_kosher flag. This is, at present, always true. We could use this to identify things like detected outliers, etc. if we really wanted to.

Lee Belbin · Answer 36 · Sun May 03 2020 05:43:17 GMT+0800 (China Standard Time)

Very nicely done @charvolant !

Lee Belbin · Answer 37 · Sun May 03 2020 08:44:24 GMT+0800 (China Standard Time)

Sunday (yes I know) - and I had some niggling feeling about this text. Mia Culpa: We still have a number of significant issues-

Coordinates are out of range for species. If the latitude or longitude are out of range (-90 to 90 for latitude, -180 to 180 for longitude) the record fails.

From Arthur Chapman and me: This has nothing to do with species!

Decimal latitude/longitude conversion failed. If we cannot convert the supplied decimal latitude and longitude to the WGS84 datum (EPSG 4326) that the ALA uses.

From John Wieczorek: Does this mean that the supplied geodeticDatum can not be interpreted unambiguously to define a geodetic coordinate reference system?

Unparseable verbatim coordinates. If there isn't a decimal latitude and longitude supplied and if we cannot convert the text (verbatim) latitude and longitude to WGS84, then the record fails. Example

From John Wieczorek: This is mixing issues. The WGS84 part is independent, and taken care of by the above. Shouldn't this be, "Unparseable verbatim coordinates. If there isn't a decimal latitude and longitude supplied and if we cannot convert the text (verbatim) latitude and longitude to decimal degrees, then the record fails." ?

I've added these as comments to the doco. At least this example is making me feel good about how much time (as in at least 1-2 person/years!) have been put into the definitions and other parameters of the TG2 tests!

Lee Belbin · Answer 38 · Sun May 03 2020 08:57:24 GMT+0800 (China Standard Time)

Also , we need to be more explicit about

"centre of the country" is defined in the ALA as the centre point of the bounding box of the country"

Does the bounding box include a buffer or Tasmania or....?
Is the 'centre' defined by a rectangular or circular buffer and if so, of what spatial extent?

Ditto centre of state/territory

Lee Belbin · Answer 39 · Thu May 07 2020 12:52:37 GMT+0800 (China Standard Time)

Just in case anyone thought dealing with spatial buffers was easy, here is a recent phrasing of the Expected response from one of the TDWG TG2 core tests (and similar will apply to the terrestrial-marine test)-

AMENDMENT_COUNTRYCODE_FROM_COORDINATES

"EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority service was not available; INTERNAL_PREREQUISITES_NOT_MET if a) dwc:decimalLatitude or dwc:decimalLongitude is EMPTY, or b) the dwc_decimalLatitude and dwc:decimalLongitude can not be unambiguously transformed to the coordinate reference system of the bdq:sourceAuthority and the location given by dwc_decimalLatitude and dwc:decimalLongitude in the coordinate reference system of the bdq:sourceAuthority lies within the boundaries of a country code or EEZ feature, but within a distance to the nearest border less than the maximum possible datum shift between any geodetic coordinate reference system and the coordinate reference system of the bdq:sourceAuthority, or c) the location given by dwc:decimalLatitude, dwc:decimalLongitude, and dwc:geodeticDatum lies further outside the boundaries of any country code or EEZ feature than the dwc:coordinateUncertaintyInMeters plus the bdq:spatialBufferInMeters, or d) the location given by dwc:decimalLatitude, dwc:decimalLongitude, and dwc:geodeticDatum lies outside the boundaries of any country code or EEZ feature, but equally close to more than one country code feature or to more than one EEZ feature; FILLED_IN if the value of dwc:countryCode was EMPTY and was unambiguously inferred from the values of dwc:decimalLatitude, dwc:decimalLongitude, dwc:geodeticDatum, dwc:coordinateUncertaintyInMeters, and bdq_spatialBufferInMeters against a country code or EEZ feature defined by bdq:sourceAuthority; otherwise NOT_CHANGED."

This also gives a fair idea on why TG2 has been a serious lot of work. As in at least 2 person years so far.

Doug Palmer · Answer 40 · Thu May 07 2020 13:29:31 GMT+0800 (China Standard Time)

Pull request in #376