eblondel / ows4R

R Interface for OGC Web-Services (OWS)

Home Page:https://eblondel.github.io/ows4R/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WFS paging and parallelization support

salvafern opened this issue · comments

Hi @eblondel ,

I have been giving a try to ows4r to query biological occurrence data from EMODnet-Biology

In this example below, I requested:

I got a WFS request using the EMODnet-Biology download toolbox (at the end of the selection, you can copy the WFS request in "Get webservice url")

Good news are that viewParams via vendor params work like a charm! (although I have to watch out for the encoding lifewatch/eurobis#15 (comment))

I am having troubles however with the paging and parallel options. After some debugging, I think the issue might be that ows4r is relying on a param named numberMatched when using resultstype = "hits" at: https://github.com/eblondel/ows4R/blob/master/R/WFSFeatureType.R#L240

And this is not being returned geo.vliz.be (should happen around: https://github.com/eblondel/ows4R/blob/master/R/WFSFeatureType.R#L291)

Could you have a look and see what is happening?

Thanks a lot!

# Example get CPR dataset, North Sea and Calanus finmarchicus

library(ows4R)
library(parallel)

# URL as provided by download toolbox
url_download_toolbox <- "http://geo.vliz.be/geoserver/wfs/ows?service=WFS&version=1.1.0&request=GetFeature&typeName=Dataportal%3Aeurobis-obisenv_basic&resultType=results&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&propertyName=datasetid%2Cdatecollected%2Cdecimallatitude%2Cdecimallongitude%2Ccoordinateuncertaintyinmeters%2Cscientificname%2Caphiaid%2Cscientificnameaccepted&outputFormat=csv"
URLdecode(url_download_toolbox)
#> [1] "http://geo.vliz.be/geoserver/wfs/ows?service=WFS&version=1.1.0&request=GetFeature&typeName=Dataportal:eurobis-obisenv_basic&resultType=results&viewParams=where:((up.geoobjectsids+&&+ARRAY[2350]))+AND+datasetid+IN+(216);context:0100;aphiaid:104464&propertyName=datasetid,datecollected,decimallatitude,decimallongitude,coordinateuncertaintyinmeters,scientificname,aphiaid,scientificnameaccepted&outputFormat=csv"

# Only params
params <- "where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464"
URLdecode(params)
#> [1] "where:((up.geoobjectsids+&&+ARRAY[2350]))+AND+datasetid+IN+(216);context:0100;aphiaid:104464"

# Create wfs client and find feature
wfs <- WFSClient$
  new("https://geo.vliz.be/geoserver/Dataportal/wfs", "1.1.0", logger = "INFO")$
  getCapabilities()$
  findFeatureTypeByName("Dataportal:eurobis-obisenv_basic")
#> [ows4R][INFO] OWSGetCapabilities - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&request=GetCapabilities

# Create cluster
cl <- makeCluster(detectCores() - 1)

# Perform tests: around 20K rows
system.time(feature_only_viewparams <- wfs$getFeatures(viewParams = params, resultType="results"))
#> [ows4R][INFO] WFSDescribeFeatureType - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&request=DescribeFeatureType 
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resultType=results&request=GetFeature
#>    user  system elapsed 
#>   0.990   0.100   3.712

system.time(feature_pagination <- wfs$getFeatures(viewParams = params, paging = TRUE, paging_length = 1000))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resulttype=hits&request=GetFeature
#> Error in seq.default(from = 0, to = numberMatched, by = paging_length): 'to' must be of length 1
#> Timing stopped at: 0.09 0.001 0.678

system.time(feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results", 
                                                parallel = TRUE, parallel_handler = parallel::mclapply, cl = cl))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resultType=results&request=GetFeature
#>    user  system elapsed 
#>   0.986   0.088   3.429

# Debugging pagination
nft <- wfs$getFeatures(viewParams = params, resultType="hits")
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resultType=hits&request=GetFeature
names(nft)
#> [1] "numberOfFeatures" "timeStamp"

"numberMatched" %in% names(nft)
#> [1] FALSE

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.6 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#> 
#> locale:
#>   [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#> [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#> [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#>   [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>   [1] httr_1.4.2    reprex_2.0.1  ows4R_0.2-1   keyring_1.3.0 geometa_0.6-6
#> 
#> loaded via a namespace (and not attached):
#>   [1] tinytex_0.35       tidyselect_1.1.1   xfun_0.28          purrr_0.3.4       
#> [5] sf_0.9-4           lattice_0.20-41    vctrs_0.3.8        generics_0.1.0    
#> [9] htmltools_0.5.0    yaml_2.2.1         utf8_1.2.2         XML_3.99-0.3      
#> [13] rlang_0.4.11       e1071_1.7-3        pillar_1.6.3       glue_1.4.2        
#> [17] withr_2.4.2        DBI_1.1.1          bit64_4.0.5        sp_1.4-6          
#> [21] lifecycle_1.0.1    evaluate_0.14      knitr_1.29         tzdb_0.1.2        
#> [25] callr_3.7.0        ps_1.6.0           curl_4.3           class_7.3-17      
#> [29] fansi_0.5.0        highr_0.8          Rcpp_1.0.7         readr_2.0.2       
#> [33] KernSmooth_2.23-17 openssl_1.4.2      classInt_0.4-3     vroom_1.5.5       
#> [37] jsonlite_1.7.0     bit_4.0.4          fs_1.5.0           hms_1.1.1         
#> [41] askpass_1.1        digest_0.6.25      processx_3.5.2     dplyr_1.0.7       
#> [45] grid_3.6.3         rgdal_1.5-12       cli_3.0.1          tools_3.6.3       
#> [49] magrittr_2.0.1     tibble_3.1.5       crayon_1.4.1       pkgconfig_2.0.3   
#> [53] ellipsis_0.3.2     assertthat_0.2.1   rmarkdown_2.11     rstudioapi_0.13   
#> [57] R6_2.5.1           units_0.6-7        compiler_3.6.3   

Created on 2022-03-29 by the reprex package (v2.0.1)

This issue partly follows up #29

@salvafern make sure to use WFS 2.0 version; AFAIK pagination in WFS is only supported in WFS 2.0, I see you used 1.1.0

Try with setting version 2.0.0 like this:

wfs <- WFSClient$
  new("https://geo.vliz.be/geoserver/Dataportal/wfs", "2.0.0", logger = "INFO")$
  getCapabilities()$
  findFeatureTypeByName("Dataportal:eurobis-obisenv_basic")

   params <- "where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464"

   #with pagination
   system.time(feature_pagination <- wfs$getFeatures(viewParams = params, paging = TRUE, paging_length = 1000))

justed tested the pagination and it worked

Indeed now it works, thanks a lot!
I was using v1.1.0 to copy what the download toolbox did, but I guess there's no harm in using v2.0.0

I have also tried now using the parellel options:

Using parellelization and pagination together

Probably I'm doing something wrong. I expected that multiple requests would be done for each chunk, but I just ran into an error.

library(ows4R)
library(parallel)

wfs <- WFSClient$
  new("https://geo.vliz.be/geoserver/Dataportal/wfs", "2.0.0", logger = "INFO")$
  getCapabilities()$
  findFeatureTypeByName("Dataportal:eurobis-obisenv_basic")

# Querying dataset: https://www.emodnet-biology.eu/data-catalog?module=dataset&dasid=8020
# ~500K rows
params <- "where%3Adatasetid+IN+%288020%29"

# With pagination and parellelization
cl <- makeCluster(detectCores() - 1)
cl
#> socket cluster with 15 nodes on host ‘localhost’

debug(wfs$getFeatures)
system.time(feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results", 
                                                paging = TRUE, paging_length = 10000,
                                                parallel = TRUE, parallel_handler = parallel::mclapply, cl = cl))
#> Error in CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  : 
#>   No layers in datasource.
#> Timing stopped at: 0.023 0 11.45

via debug() I can see that at some point, a request of type 'hits' is read with sf::st_read(), which of course fails. This happens at https://github.com/eblondel/ows4R/blob/master/R/WFSFeatureType.R#L328

The response in destfile looks like

<?xml version="1.0" encoding="UTF-8"?>
<wfs:FeatureCollection
	xmlns:xs="http://www.w3.org/2001/XMLSchema"
	xmlns:fes="http://www.opengis.net/fes/2.0"
	xmlns:wfs="http://www.opengis.net/wfs/2.0"
	xmlns:gml="http://www.opengis.net/gml/3.2"
	xmlns:ows="http://www.opengis.net/ows/1.1"
	xmlns:xlink="http://www.w3.org/1999/xlink"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" numberMatched="408603" numberReturned="0" timeStamp="2022-03-31T07:57:57.251Z" xsi:schemaLocation="http://www.opengis.net/wfs/2.0 http://schemas.opengis.net/wfs/2.0/wfs.xsd"/>

Using only parallelization

I tried comparing no parallelization vs parallelization with mclapply and parLapply but I'm not seeing any improvement on the performance. Probably it needs pagination as well?

# No pagination nor parellelization
system.time(feature <- wfs$getFeatures(viewParams = params, resultType="results"))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=2.0.0&typeNames=Dataportal:eurobis-obisenv_basic&viewParams=where%3Adatasetid+IN+%288020%29&resultType=results&request=GetFeature 
#> user  system elapsed 
#> 26.718   2.080  67.476

# Parallelization parLapply
system.time(feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results", 
                                                parallel = TRUE, parallel_handler = parallel::parLapply, cl = cl))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=2.0.0&typeNames=Dataportal:eurobis-obisenv_basic&viewParams=where%3Adatasetid+IN+%288020%29&resultType=results&request=GetFeature 
#> user  system elapsed 
#> 27.457   2.477  65.883

# Parallelization mclapply
system.time(feature_parallel2 <- wfs$getFeatures(viewParams = params, resultType="results", 
                                                 parallel = TRUE, parallel_handler = parallel::mclapply, cl = cl))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=2.0.0&typeNames=Dataportal:eurobis-obisenv_basic&viewParams=where%3Adatasetid+IN+%288020%29&resultType=results&request=GetFeature 
#> user  system elapsed 
#> 26.226   2.274  63.895 

Many thanks again for the help! Let me know if I there is anything I can do.

Yes, sounds they are issues with the parallelization, will have a look asap.

If you want to use the cluster approach, you can use this handler : parallel::parLapply which works with cluster. mclapply can't work apparently because I didn't allow specifying the extra args needed for this handler

I got the same error :(

feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results", 
                                    paging = TRUE, paging_length = 10000,
                                    parallel = TRUE, parallel_handler = parallel::parLapply, cl = cl)
#> Error in CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  : 
#>   No layers in datasource.

@salvafern i don't forget this, i started working on it, but still looking into the best way to fix the parallel handlers.