Convert Bird Tracking dataset to sample event format

Question

Convert Bird Tracking dataset to sample event format

kbraak opened this issue 8 years ago · comments

One of the recommendations identified during the EU Nodes Workshop in Lisbon 18-19 April 2016, was to try and investigate how to represent live collections as a kind of monitoring event dataset, whereby a single event record would exist for each individual being tracked. That event record would then have all the associated occurrences showing where that individual was tracked through time.

INBO’s Bird Tracking dataset instantly came to my mind. I wanted to pitch the idea of trying to convert this occurrence dataset to sample event format.

Here’s how I would represent a sampling event for Harry for example:

eventID=L907322
eventDate=Date range Harry was been tracked
sampleSizeValue/sampleSizeUnit & samplingEffort=Number of days/years Harry was tracked
samplingProtocol=doi:10.1007/s10336-012-0908-1

The Occurrences would be like you have them, except they would all relate to a sampling event via the eventID, allowing you to lookup more information about Harry’s tracking such as:

How long Harry has been tracked?
A map showing all of Harry’s occurrences where he has been tracked

In the occurrence format these kinds of inquiries are more difficult to answer without going through all Harry’s records.

I’m afraid I haven’t gone through all of Harry’s tracking records, nor read the sampling protocol in depth. Presumably, however, the sampling protocol dictates how long sampling/tracking should take place, such as a single migration season. Therefore assuming Harry’s migrations have been observed once a year, every year, between 2013 and 2015 using the same sampling protocol, it would probably be appropriate to have a single sampling event for each year:

parentEventID=L907322
eventID=L907322-2013 & parentEventID=L907322
eventID=L907322-2014 & parentEventID=L907322
eventID=L907322-2015 & parentEventID=L907322

Notice each sampling event relates to an (abstract) parent event with eventID L907322 representing Harry’s collective tracking.

Looking at the parent event, you could then see a list of all Harry’s sampling events over time, and maybe even compare his tracking by toggling layers on a map - one for each year/sampling event. Of course GBIF.org can’t show this page yet, but we’re busy trying to figure out how to visualise sample event data and your dataset provides a lot of inspiration.

During the workshop, we also discussed what should happen when an existing occurrence dataset gets republished as a sample event dataset. According to the IPT Versioning Policy based on DataCite’s recommendations, the dataset should only be assigned a new DOI if it has undergone scientifically significant changes. In my opinion, if you were to convert your bird tracking dataset into sample event format, we’d just be adding information, not changing existing information about each occurrence record therefore I don't think it should be assigned a new DOI. Of course it could be cleaner just to assign it a new DOI though, and this would give you the freedom to update the dataset any which way you like.

Anyways, I’d love to hear your feedback on the idea of converting this to sample event format. It would definitely be a valuable exemplar sample event dataset, and I’d be happy to work together with the INBO team to make it happen.

Best regards,
Kyle

PS: Some exemplar sample event datasets are available at the following URLs:

Peter Desmet · Answer 1 · Fri Apr 29 2016 23:14:45 GMT+0800 (China Standard Time)

Hi @kbraak,

Some quick thoughts:

I immediately thought about the DOI too. I would keep it, so the data paper can still refer to the same datasets. As will probably have to make some weird jumps to update an occurrence datasets to a sample dataset on an IPT, I was wondering if we could manually force the version to change from 5.5 to 6.0 in the process?
Are you sure you want to "misuse" event for an individual? Because there is more information you could add to that Event/Individual (weight, lifeStage, etc.) rather than repeating it for every occurrence, but it might be confusing to mix those concepts.
The sampling protocol is not time based: we just keep those trackers running until they die, so it doesn't really make sense to split them per year. What does change is the tracking effort (every 30 minutes, every 5 minutes, etc., currently indicated in samplingEffort: {"secondsSinceLastOccurrence":1816}), but the protocol for this can be complicated (e.g. change this when the bird gets near its nest), which would potentially create a whole lot of different Events, which is not that problematic if they are all linked to a parent id. More importantly, I don't think we have those tracking settings in our database (would have to ask), so a single tracking Event is maybe more appropriate.

Dag Endresen · Answer 2 · Sat Apr 30 2016 01:00:53 GMT+0800 (China Standard Time)

One quick reflection: "L907322" (with more) looks more like a dwc:fieldNumber type of value than a dwc:eventID. I know Darwin Core does not demand globally unique persistent identifiers for the ID terms (such as eventID) but for examples I think it might be a good practice to follow.

Peter Desmet · Answer 3 · Tue May 03 2016 15:22:51 GMT+0800 (China Standard Time)

@dagendresen L907322 is actually the organismID: it's the code on the metal ring around the leg of the bird.

@kbraak let me know what you think about my quick thoughts. If it wasn't clear: I'm certainly open to this exercise... we'll just have to do it in a way that makes the most sense. :-)

Dag Endresen · Answer 4 · Tue May 03 2016 15:32:44 GMT+0800 (China Standard Time)

@peterdesmet Would it not be cool if we had persistent identifiers for organismID - and perhaps "L907322" could be the organismName?

Dag Endresen · Answer 5 · Tue May 03 2016 15:42:23 GMT+0800 (China Standard Time)

I am not convinced that each position for the bird here is a dwc:Occurrence. Remember that dwc:Occurrence is not the same as an organism occurring somewhere, but rather a type of evidence or token of that organism. The log-file for one "flight"-trip of the bird could rather more appropriately be the dwc:Occurrence (and/or dwc:Event ?) - and the dwc:footprintWKT (for the flight) be the location rather than the longitude and latitude coordinates?
I also agree with @peterdesmet that using dwc:Event for that bird - and thus for ALL flights of the bird would be abusing this term. The bird and the event is not the same thing.

Peter Desmet · Answer 6 · Tue May 03 2016 15:42:24 GMT+0800 (China Standard Time)

@dagendresen, I would argue that is a persistent identifier: it's carved in metal, registered and used internationally.

There were two other codes I could have used: the GPS tracker serial number and the code on the plastic blue ring, but their use is more specific. Also, we give the birds names, so organismName is reserved. 😄

Dag Endresen · Answer 7 · Tue May 03 2016 15:46:36 GMT+0800 (China Standard Time)

@peterdesmet with "persistent identifier" I mean a globally unique and preferably resolvable identifier. Just because the ring number is a controlled number within the bird ringing community does not make it a "persistent identifier". -- In the same way as a catalog number is unique and permanent within a natural history museum is not a "persistent identifier". Catalog numbers are also often carved in metal and used internationally ;-)

Dag Endresen · Answer 8 · Tue May 03 2016 15:52:42 GMT+0800 (China Standard Time)

@peterdesmet Could perhaps the names you assign to the birds go in the dwc:organismRemarks and the "L907322" numbers still go in the dwc:organismName? Only because the dwc:organismName includes the English word "name" in the term name, does not mean that it is strictly meant for such "names"...? I will argue that Darwin Core would be more consistent if the term instead was e.g. "dwc:organismCode" - however there are many more inconsistently named Darwin Core terms...

Peter Desmet · Answer 9 · Tue May 03 2016 16:22:20 GMT+0800 (China Standard Time)

Regarding the identifier

It seems odd to me to bump this identifier to organismName, if we have actually have a name for the organism.
I know you want to keep organismID reserved for the ultimate identifiers, but on a pragmatic level, L907322 is the most logical one, and it is conform the definition:

An identifier for the Organism instance (as opposed to a particular digital record of the Organism). May be a globally unique identifier or an identifier specific to the data set.

In fact, it's one that is not only used in the digital world, which is why I think it's a good one: chances are high that if someone out there wants to say something else about this bird, they are going to use this identifier.

There are to my knowledge no better alternative identifiers out there for this bird.

Occurrence

I am not convinced that each position for the bird here is a dwc:Occurrence. Remember that dwc:Occurrence is not the same as an organism occurring somewhere, but rather a type of evidence or token of that organism. The log-file for one "flight"-trip of the bird could rather more appropriately be the dwc:Occurrence - and the dwc:footprintWKT be the location rather than the longitude and latitude coordinates?

Each record in the log file is evidence that the bird occurred at that place: that's definitely one way we use the data, e.g. to calculate how much time they spend at certain places, much like regular observations.
It's much easier in the data to identify an occurrence (i.e. one record) than to identify a trip: where does it stop and end?: the trackers are on constantly.
If you express the data as "flights", you will need to aggregate information in the WKT, such as the altitude and date time, otherwise you loose it. In addition, it becomes less rapidly available: for example, it won't be shown on the GBIF maps + needs more processing before you can actually use it (for which you likely want to deaggregate it again).

Dimi Brosens · Answer 10 · Tue May 03 2016 17:58:48 GMT+0800 (China Standard Time)

My2cents...

I do feel relatively comfortable if it comes to sample based data.... I was really reserved in the beginning, but now it looks like an intelligent and more sophisticated way to present Biodiversity data. We have actually a lot of datasets here in INBO, which you would undoubtedly publish as sample based data.
For me this means: in the eventCore you give the information related to the or an event. (We went there and did this or that) and in the occurrenceExtension you can provide info on the occurrences you recorded and maybe there is one or more extension (measurements or facts mostly) where you can place some more information.

Right, looking at the gull tracking dataset, I do not have the gut-feeling that we should present this data as a sample based dataset. Wat was done was (very straight forward), put a tracker on a bird and receive the information from the tracker... for me this data looks like real occurrence data (machine observation).

It is not that we go to a certain plot or place to record the bird data... (which would be the samplingEvent). No, the bird flies and we do track it.

Also, all the data which is obtained, is published. So why would we want to make it difficult for us, when the data is already published in a understandable, standardized way. Also, this data is no monitoring data, where you, again, would have a standardized protocol to monitor birds.

If we would transform a bird_observation dataset in a sample based dataset, I would think in the first place of this one: https://github.com/inbo/data-publication/blob/master/datasets/watervogels-occurrences/metadata.md

I would maybe focus on unpublished data instead of trying to transfer published occurrence data in sample based data... but that is just an opinion! And loads of interesting work lying ahead of us.

Kyle Braak · Answer 11 · Wed May 04 2016 20:05:51 GMT+0800 (China Standard Time)

@peterdesmet please see my answers below.

1.I immediately thought about the DOI too. I would keep it, so the data paper can still refer to the same datasets. As will probably have to make some weird jumps to update an occurrence datasets to a sample dataset on an IPT, I was wondering if we could manually force the version to change from 5.5 to 6.0 in the process?

Ideally your IPT is configured with a DataCite or EZID account. This will enable the resource manager to assign the dataset a new DOI forcing a major version change, which is recommended practice when the dataset undergoes scientifically significant changes. I wouldn't recommend manually forcing the version change.

2.Are you sure you want to "misuse" event for an individual? Because there is more information you could add to that Event/Individual (weight, lifeStage, etc.) rather than repeating it for every occurrence, but it might be confusing to mix those concepts.

The event wouldn't represent the individual, it would represent how the individual was being monitored, describing the sampling protocol used. The sampling protocol would consist of two parts: i) the type of bio-logger or tracking device used and ii) the measurement scheme (interval) that the device was set to use.

According to the data paper:

[m]easurement intervals can be set for different times of the day (e.g. day and night), different geographical areas (e.g. inside or outside a breeding colony), status of memory (e.g. accelerometer is switched-off if the memory is filled to a certain threshold) or battery voltage (e.g.
shorter measurement intervals if the battery is fully charged). Thus, the measurement scheme can be dynamically tailored to the specific research questions, the behaviour of the bird species, environmental conditions or tag performance, all of which may change during the course of the season or study.

A new occurrence record would be created for each measurement made. Researchers interesting in answering specific research questions could filter occurrence records derived from sampling events having the specific protocol(s) that they are interested in.

3.The sampling protocol is not time based: we just keep those trackers running until they die, so it doesn't really make sense to split them per year. What does change is the tracking effort (every 30 minutes, every 5 minutes, etc., currently indicated in samplingEffort: {"secondsSinceLastOccurrence":1816}), but the protocol for this can be complicated (e.g. change this when the bird gets near its nest), which would potentially create a whole lot of different Events, which is not that problematic if they are all linked to a parent id. More importantly, I don't think we have those tracking settings in our database (would have to ask), so a single tracking Event is maybe more appropriate.

Maybe not split per year, but I would argue based on the information above, it makes sense to explain how the measurement scheme is changing over time. They could all be related to a parent event, but like the paper explains, it would be valuable to track changes to the measurement scheme over time in order to assist researchers trying to answer different questions with the data. Hopefully you have that information in your database. In the absence of this information, the sampling protocol could just indicate whether the measurement scheme was being changed or not - an extra helpful bit of information at record-level, for understanding how the individual was being tracked.

Peter Desmet · Answer 12 · Tue May 10 2016 20:59:52 GMT+0800 (China Standard Time)

@kbraak, I still find that sampling events are not intuitively distilled from this dataset. This is in contrast with the datasets that @DimEvil mentions, were repeated sampling is defined as part of the monitoring setup, i.e. returning to the same place at regular intervals.

I think the main advantages of the eventCore is to one can 1) group otherwise repeated information, i.e. location, time and protocol information, 2) add measurements or facts about the sample and 3) develop tools that visualize this grouped information.

For this dataset, location and time are specific for each occurrence, so that information cannot be grouped (in contrast with a typical sampling datasets). We can group samplingProtocol, as it is the same for all records (the DOI of the paper describing the system) and maybe samplingEffort (currently derived from the data, but potentially from settings metadata), but that will create thousands of events. I'd argue that normalizing the data as such doesn't make it more user friendly than keeping it with denormalized occurrences.
Other than the samplingProtocol and samplingEffort, we don't have measurements or facts about the sample. We have lots of measurements and facts about the individual: the currently published lifeStage, sex, but also bill length, bill depth, tarsus length, wing length, and body mass. That would be a logical way to denormalize the data (which we also do in our database).
I assume GBIF has plans to have sample pages, just like occurrence pages? It would be really nice if we could have pages for individuals in tracking datasets too. We could (mis)use event pages for those, by having one event = one individual (and not splitting any further), but it's still a bit of a hack (but one I don't necessarily dismiss). A more robust approach is to make organisms a core (it's already a class in Darwin Core), which would have all the advantages listed in 2 and a much better fit for the tracking community.

Also, it think we should get in touch with http://movebank.org. It's a huge (not always open) repository of tracking data + tools to analyze these. They have a much larger scope and knowledge about what is the best model for all tracking data and I think there is potential for a great collaboration between GBIF and them.

Dag Endresen · Answer 13 · Tue May 10 2016 21:28:40 GMT+0800 (China Standard Time)

The new event core in a Darwin Core archive does not really allow you to provide measurements or facts about samples (dwc:MaterialSample). You can declare one event for each sample - but you still describe measurement or fact for the event, not the sample ;-)
New cores for samples and organisms (dwc:Organism) would be nice - however, the GBIF portal could infer attributes for these Darwin Core classes from a denormalized simple Darwin Core data record... to display in the portal... The information is sort of there even if it is denormalized (and somewhat loosely/unspesific connected to class).
I think that adding a subject or resource identifier and perhaps a type (basis of record-like) in the MeasurementOrFact would be a more useful approach. One can connect a measurement or fact to the exact subject (of the accurate Darwin Core class) that it should be connected to.
In addition, I think that making MeasurementOrFact a new core would be useful - linking out to subjects with IDs and described outside the dataset, e.g. in other datasets (mandates globally unique persistent identifiers).

Peter Desmet · Answer 14 · Wed May 11 2016 17:33:48 GMT+0800 (China Standard Time)

@kbraak I discovered that data deposited in the Movebank is also denormalized, with multiple individuals together, just like our data. E.g. doi.org/10.5441/001/1.hn1bd23k

Peter Desmet · Answer 15 · Thu Aug 18 2016 19:48:25 GMT+0800 (China Standard Time)

I'm closing this issue, reopen if you think it's worthwhile to continue this discussion.