inbo / data-publication

🔓 Open biodiversity data publication by the INBO

Home Page:https://ipt.inbo.be

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Derive positions for observations

peterdesmet opened this issue · comments

There are two ways in which observations in SAS are linked to positions:

  1. A direct one-to-one link with calculatedmeasurements: these calculated measurements are imported historical data, where original ship log files are not available. These make up the bulk of the data, but these won't grow with time.
  2. An indirect link via datetime, in which an observation falls between ship log positions (stored in measurements). This is the case for all new data, in which ship logs are uploaded and the link between observations and these logs are made at export time. This type of data will grow with time.

1. Direct link

Retrieving positions from the calculated measurements is easy:

SELECT
    obs.id,
    obs.starttime,
    obs.endtime,
    trip.id,
    trip.dateofsurvey,
    calc_meas.latitude,
    calc_meas.longitude
FROM observation AS obs
    LEFT JOIN surveyevent AS trip
        ON obs.surveyevent = trip.id
    LEFT JOIN calculatedmeasurement AS calc_meas
        ON obs.calculatedmeasurement = calc_meas.id
WHERE
    calc_meas.id IS NOT NULL

This results in 289,643 observations (on UAT).

2. Indirect link

This should get the positions for the 8,506 other observations (on UAT). The join between observation (obs) and measurement (meas) should be based on:

  1. Same trip: obs.surveyevent = meas.surveyevent
  2. Same date (but that should already be guaranteed by the trip, which never goes beyond a single date): surveyevent.dateofsurvey = meas.dateofmeasurement
  3. Same (!) starttime: obs.starttime = meas. timeofmeasurement
  4. Same (!) endtime: obs.endtime = meas.timeofmeasurement

This should result in 0, 1 or 2 measurements:

  • If 0: no coordinates
  • If 1: take those coordinates (doesn't seem to be done like this in the application, see next comment)
  • If 2: take the average of those coordinates

Some documentation on how measurements are chosen for an observation:

  • SAS wiki explaining how observations (or rather their unique positions) are linked to conditions and measurements
  • Java code where getLongitude() is used for a start time and end time, after which the average is used. In contrast with the wiki, it seems that 2 coordinates are always required.
  • Cucumber test, which also indicates that the times of the observations and measurements should be the same.

So, the query in pseudo code is:

for each observation
get surveyevent
get calculatedmeasurement
if
   lat/long in calculatedmeasurements NOT NULL => use those
else
   get measurement for starttime
   get measurement for endtime
   take average of measurement.lat/long => use those

query should sound like :

SELECT
    obs.id,
    obs.starttime,
    obs.endtime,
    trip.id,
    trip.dateofsurvey,
    calc_meas.latitude,
    calc_meas.longitude,
    meas.avglatitude,
    meas.avglongitude,
    Coalesce(calc_meas.latitude,meas.avglatitude) as Compl_latitude,
    Coalesce(calc_meas.longitude,meas.avglongitude) as Compl_longitude
FROM observation AS obs
    LEFT JOIN surveyevent AS trip
        ON obs.surveyevent = trip.id
    LEFT JOIN calculatedmeasurement AS calc_meas
        ON obs.calculatedmeasurement = calc_meas.id

    LEFT OUTER JOIN ( SELECT
			    obs.id,
			    AVG(meas.latitude) as avgLatitude,
			    AVG(meas.longitude) as avgLongitude,
			    Count(*) as NbrMeas
			FROM observation AS obs
			    INNER JOIN surveyevent AS trip
				ON obs.surveyevent = trip.id
			    INNER JOIN measurement meas ON meas.surveyevent = obs.surveyevent 
						AND trip.dateofsurvey  = meas.dateofmeasurement 
						AND (obs.starttime = meas.timeofmeasurement 
						OR obs.endtime = meas.timeofmeasurement)
			WHERE 1=1
			AND obs.calculatedmeasurement IS NULL
			GROUP BY obs.id
			HAVING Count(*) <= 2
			) meas ON meas.id = obs.id
WHERE  1=1
LIMIT 100

The added outer join selects per obs.id the average long. and lat. , while making sure that only 2 or less measurements are used (the inner join within the query assures there is at least 1 measurement ).
The select statement is still a mess, all the calculated measurements and other measurements are shown, including the complemented version off them.

Update on an old issue: SAS are now included in https://esas.ices.dk and published as Darwin Core to EMODNet/OBIS/GBIF via https://github.com/EMODnet/esas2obis.