BigelowLab / thredds

THREDDS Crawler for R using xml2 package

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Help needed in listing datasets and grep specific datasets

abigmo opened this issue · comments

Dear BigelowLab,
I am a very new user of thredds and I am finding a little tricky to list all available datasets in an atmospheric observations thredd database named EBAS.
There are 13294 datasets in this database, and the possibility to use grep or alike would be useful to select specific datasets.

library(ncdf4)
library(thredds)

> url <- 'https://thredds.nilu.no/thredds/catalog/ebas/catalog.xml'
> ctlg <- thredds::CatalogNode$new(url)

> ctlg
CatalogNode (R6): 
  verbose: FALSE    tries: 3    namespace prefix: d1
  url: https://thredds.nilu.no/thredds/catalog/ebas/catalog.xml
  services [7]: Compound OPENDAP DAP4 HTTPServer NCML UDDC ISO
  catalogRefs [0]: none
  datasets [13294]: ZA0001G.20150101000000.20220206000000.uv_abs.ozone.air.7y.1h.ZA02L_thermo_4m.ZA02L_primary_standard_49i_PS.lev2.nc ZA0001G.20140417090500.20210831153724.glass_flask..air.3y.10d.US13L_glass_flask_ZA0001G.US13L_GC_FID_v1.lev2.nc ... AM0001R.20090101030000.20181210133000.filter_3pack...1y.1d.AM01L_A_Teflonfilter_02.AM01L_uv_abs_Nessler.lev2.nc AM0001R.20081231200000.20140501000000.uv_abs.ozone.air.4y.1h.AM01L_uv_abs_02.AM01L_uv_abs..nc

> ctlg$list_datasets("*")
$dap
       name serviceType        base 
      "dap"  "Compound"          "" 

$EBAS
  name     ID 
"EBAS" "EBAS" 

> ctlg$get_datasets("EBAS")
NULL

base.url <- "https://thredds.nilu.no/thredds/dodsC"

Any hint?

Hello,

Every catalog maker brings a different twist to organizing, so learning thredds can be a bumpy road... In this case there really is no depth or hierarchy. So, once you open the catalog you have in hand the complete dataset listing. The trick is to convert it to a more useful form, like a data.frame, so it's easier to find your goodies. Below I show how to transform the listing into a tibble, and then filter based upon a naming pattern.

Does this help?

# This [dataset](https://thredds.nilu.no/thredds/catalog/ebas/catalog.html) has 
# what I call a flat database structure - all resources are served form the top directory.  
# So, instead of needing to navigate through a hierarchy, an easy shortcut is to 
# transform the dataset list to a tibble and then use filter/slicing to find your 
# desired data.

suppressPackageStartupMessages({
  library(thredds)
  library(dplyr)
})
url <- 'https://thredds.nilu.no/thredds/catalog/ebas/catalog.xml'
(ctlg <- thredds::CatalogNode$new(url))
#> CatalogNode (R6): 
#>   verbose: FALSE    tries: 3    namespace prefix: d1
#>   url: https://thredds.nilu.no/thredds/catalog/ebas/catalog.xml
#>   services [7]: Compound OPENDAP DAP4 HTTPServer NCML UDDC ISO
#>   catalogRefs [0]: none
#>   datasets [13294]: ZA0001G.20150101000000.20220206000000.uv_abs.ozone.air.7y.1h.ZA02L_thermo_4m.ZA02L_primary_standard_49i_PS.lev2.nc ZA0001G.20140417090500.20210831153724.glass_flask..air.3y.10d.US13L_glass_flask_ZA0001G.US13L_GC_FID_v1.lev2.nc ... AM0001R.20090101030000.20181210133000.filter_3pack...1y.1d.AM01L_A_Teflonfilter_02.AM01L_uv_abs_Nessler.lev2.nc AM0001R.20081231200000.20140501000000.uv_abs.ozone.air.4y.1h.AM01L_uv_abs_02.AM01L_uv_abs..nc

#List the datasets and transform into a tibble to ease the pain of having 13k records!
  
dd <- ctlg$list_datasets(form = 'table') |> 
  dplyr::as_tibble() |>
  dplyr::glimpse()
#> Rows: 13,294
#> Columns: 3
#> $ name    <chr> "ZA0001G.20150101000000.20220206000000.uv_abs.ozone.air.7y.1h.…
#> $ ID      <chr> "EBAS/ZA0001G.20150101000000.20220206000000.uv_abs.ozone.air.7…
#> $ urlPath <chr> "ebas/ZA0001G.20150101000000.20220206000000.uv_abs.ozone.air.7…

# Now you can filter (or slice) by looking for a pattern in the name. You have a lot of options, but here I just search for a fixed name pattern.

x <- dd |>
  dplyr::filter(grepl("ZA0001G.20060101000000.20150328092423", name, fixed = TRUE)) |>
  dplyr::glimpse()
#> Rows: 1
#> Columns: 3
#> $ name    <chr> "ZA0001G.20060101000000.20150328092423.filter_absorption_photo…
#> $ ID      <chr> "EBAS/ZA0001G.20060101000000.20150328092423.filter_absorption_…
#> $ urlPath <chr> "ebas/ZA0001G.20060101000000.20150328092423.filter_absorption_…

# Now open the resource with your favorite OPeNDAP reader: ncdf4, RNetCDF, etc.

library(ncdf4)
base.url <- "https://thredds.nilu.no/thredds/dodsC"
(X <- ncdf4::nc_open(file.path(base.url, x$urlPath[1])))
#> File https://thredds.nilu.no/thredds/dodsC/ebas/ZA0001G.20060101000000.20150328092423.filter_absorption_photometer.aerosol_absorption_coefficient..3y.1h.ZA02L_Radiance-Research_PSAP-3W_CPT.ZA02L_abs_coef.lev2.nc (NC_FORMAT_CLASSIC):
#> 
#>      32 variables (excluding dimension variables):
#>         double time_bnds[tbnds,time]   
#>             standard_name: time
#>             long_name: time bounds for measurement
#>             units: days since 1900-01-01 00:00:00 UTC
#>         double metadata_time_bnds[tbnds,metadata_time]   
#>             standard_name: time
#>             long_name: time bounds for ebas metadata intervals
#>             units: days since 1900-01-01 00:00:00 UTC
#>             calendar: gregorian
#>         int temperature_0_qc[time,temperature_0_qc_flags,Location]   
#>             _FillValue: 0
#>             standard_name: status_flag
#>             missing_value: 0
#>             units: 1
#>         int aerosol_absorption_coefficient_perc8413_0_qc[time,aerosol_absorption_coefficient_perc8413_0_qc_flags,Wavelength]   
#>             _FillValue: 0
#>             missing_value: 0
#>             units: 1
#>         char temperature_1_ebasmetadata[maxStrlen64,metadata_time,Location]   
#>             long_name: ebas metadata for different time intervals; json encoded
#>         char pressure_1_ebasmetadata[maxStrlen64,metadata_time,Location]   
#>             long_name: ebas metadata for different time intervals; json encoded
#>         int aerosol_absorption_coefficient_prec1587_0_qc[time,aerosol_absorption_coefficient_prec1587_0_qc_flags,Wavelength]   
#>             _FillValue: 0
#>             missing_value: 0
#>             units: 1
#>         char aerosol_absorption_coefficient_perc8413_1_ebasmetadata[maxStrlen64,metadata_time,Wavelength]   
#>             long_name: ebas metadata for different time intervals; json encoded
#>         int pressure_0_qc[time,pressure_0_qc_flags,Location]   
#>             _FillValue: 0
#>             standard_name: status_flag
#>             missing_value: 0
#>             units: 1
#>         char aerosol_absorption_coefficient_prec1587_1_ebasmetadata[maxStrlen64,metadata_time,Wavelength]   
#>             long_name: ebas metadata for different time intervals; json encoded
#>         int temperature_1_qc[time,temperature_1_qc_flags,Location]   
#>             _FillValue: 0
#>             standard_name: status_flag
#>             missing_value: 0
#>             units: 1
#>         int aerosol_absorption_coefficient_prec1587_1_qc[time,aerosol_absorption_coefficient_prec1587_1_qc_flags,Wavelength]   
#>             _FillValue: 0
#>             missing_value: 0
#>             units: 1
#>         char pressure_0_ebasmetadata[maxStrlen64,metadata_time,Location]   
#>             long_name: ebas metadata for different time intervals; json encoded
#>         int aerosol_absorption_coefficient_amean_1_qc[time,aerosol_absorption_coefficient_amean_1_qc_flags,Wavelength]   
#>             _FillValue: 0
#>             missing_value: 0
#>             units: 1
#>         char aerosol_absorption_coefficient_prec1587_0_ebasmetadata[maxStrlen64,metadata_time,Wavelength]   
#>             long_name: ebas metadata for different time intervals; json encoded
#>         char aerosol_absorption_coefficient_amean_0_ebasmetadata[maxStrlen64,metadata_time,Wavelength]   
#>             long_name: ebas metadata for different time intervals; json encoded
#>         char temperature_0_ebasmetadata[maxStrlen64,metadata_time,Location]   
#>             long_name: ebas metadata for different time intervals; json encoded
#>         int pressure_1_qc[time,pressure_1_qc_flags,Location]   
#>             _FillValue: 0
#>             standard_name: status_flag
#>             missing_value: 0
#>             units: 1
#>         char aerosol_absorption_coefficient_perc8413_0_ebasmetadata[maxStrlen64,metadata_time,Wavelength]   
#>             long_name: ebas metadata for different time intervals; json encoded
#>         int aerosol_absorption_coefficient_amean_0_qc[time,aerosol_absorption_coefficient_amean_0_qc_flags,Wavelength]   
#>             _FillValue: 0
#>             missing_value: 0
#>             units: 1
#>         char aerosol_absorption_coefficient_amean_1_ebasmetadata[maxStrlen64,metadata_time,Wavelength]   
#>             long_name: ebas metadata for different time intervals; json encoded
#>         int aerosol_absorption_coefficient_perc8413_1_qc[time,aerosol_absorption_coefficient_perc8413_1_qc_flags,Wavelength]   
#>             _FillValue: 0
#>             missing_value: 0
#>             units: 1
#>         double temperature_0[time,Location]   
#>             _FillValue: NaN
#>             standard_name: air_temperature
#>             missing_value: NaN
#>             units: K
#>             ancillary_variables: temperature_0_qc temperature_0_ebasmetadata
#>             cell_methods: time: mean
#>         double temperature_1[time,Location]   
#>             _FillValue: NaN
#>             standard_name: air_temperature
#>             missing_value: NaN
#>             units: K
#>             ancillary_variables: temperature_1_qc temperature_1_ebasmetadata
#>             cell_methods: time: mean
#>         double pressure_1[time,Location]   
#>             _FillValue: NaN
#>             standard_name: air_pressure
#>             missing_value: NaN
#>             units: hPa
#>             ancillary_variables: pressure_1_qc pressure_1_ebasmetadata
#>             cell_methods: time: mean
#>         double pressure_0[time,Location]   
#>             ancillary_variables: pressure_0_qc pressure_0_ebasmetadata
#>             missing_value: NaN
#>             units: hPa
#>             _FillValue: NaN
#>             standard_name: air_pressure
#>             cell_methods: time: mean
#>         double aerosol_absorption_coefficient_amean_1[time,Wavelength]   
#>             _FillValue: NaN
#>             missing_value: NaN
#>             units: 1/Mm
#>             ancillary_variables: aerosol_absorption_coefficient_amean_1_qc aerosol_absorption_coefficient_amean_1_ebasmetadata
#>             cell_methods: time: mean
#>         double aerosol_absorption_coefficient_amean_0[time,Wavelength]   
#>             _FillValue: NaN
#>             missing_value: NaN
#>             units: 1/Mm
#>             ancillary_variables: aerosol_absorption_coefficient_amean_0_qc aerosol_absorption_coefficient_amean_0_ebasmetadata
#>             cell_methods: time: mean
#>         double aerosol_absorption_coefficient_prec1587_1[time,Wavelength]   
#>             _FillValue: NaN
#>             missing_value: NaN
#>             units: 1/Mm
#>             ancillary_variables: aerosol_absorption_coefficient_prec1587_1_qc aerosol_absorption_coefficient_prec1587_1_ebasmetadata
#>             cell_methods: time: percentile:15.87
#>         double aerosol_absorption_coefficient_prec1587_0[time,Wavelength]   
#>             _FillValue: NaN
#>             missing_value: NaN
#>             units: 1/Mm
#>             ancillary_variables: aerosol_absorption_coefficient_prec1587_0_qc aerosol_absorption_coefficient_prec1587_0_ebasmetadata
#>             cell_methods: time: percentile:15.87
#>         double aerosol_absorption_coefficient_perc8413_0[time,Wavelength]   
#>             ancillary_variables: aerosol_absorption_coefficient_perc8413_0_qc aerosol_absorption_coefficient_perc8413_0_ebasmetadata
#>             cell_methods: time: percentile:84.13
#>             units: 1/Mm
#>             _FillValue: NaN
#>             missing_value: NaN
#>         double aerosol_absorption_coefficient_perc8413_1[time,Wavelength]   
#>             ancillary_variables: aerosol_absorption_coefficient_perc8413_1_qc aerosol_absorption_coefficient_perc8413_1_ebasmetadata
#>             cell_methods: time: percentile:84.13
#>             units: 1/Mm
#>             _FillValue: NaN
#>             missing_value: NaN
#> 
#>      16 dimensions:
#>         Location  Size:1 
#>         Wavelength  Size:3 
#>         aerosol_absorption_coefficient_amean_0_qc_flags  Size:2 (no dimvar)
#>         aerosol_absorption_coefficient_amean_1_qc_flags  Size:2 (no dimvar)
#>         aerosol_absorption_coefficient_perc8413_0_qc_flags  Size:2 (no dimvar)
#>         aerosol_absorption_coefficient_perc8413_1_qc_flags  Size:2 (no dimvar)
#>         aerosol_absorption_coefficient_prec1587_0_qc_flags  Size:2 (no dimvar)
#>         aerosol_absorption_coefficient_prec1587_1_qc_flags  Size:2 (no dimvar)
#>         maxStrlen64  Size:64 (no dimvar)
#>         metadata_time  Size:3 
#>             standard_name: time
#>             long_name: time of ebas metadata intervals
#>             units: days since 1900-01-01 00:00:00 UTC
#>             axis: T
#>             calendar: gregorian
#>             bounds: metadata_time_bnds
#>             cell_methods: mean
#>         pressure_0_qc_flags  Size:2 (no dimvar)
#>         pressure_1_qc_flags  Size:2 (no dimvar)
#>         tbnds  Size:2 (no dimvar)
#>         temperature_0_qc_flags  Size:2 (no dimvar)
#>         temperature_1_qc_flags  Size:2 (no dimvar)
#>         time  Size:26304 
#>             standard_name: time
#>             long_name: time of measurement
#>             units: days since 1900-01-01 00:00:00 UTC
#>             axis: T
#>             calendar: gregorian
#>             bounds: time_bnds
#>             cell_methods: mean
#> 
#>     47 global attributes:
#>         Conventions: CF-1.7, ACDD-1.3
#>         featureType: timeSeries
#>         title: Ground based in situ observations of aerosol_absorption_coefficient at Cape Point (ZA0001G) using filter_absorption_photometer
#>         keywords: NOAA-ESRL, pm1, pm10, aerosol_absorption_coefficient, Cape Point, GAW-WDCA, ZA0001G
#>         id: ZA0001G.20060101000000.20150328092423.filter_absorption_photometer.aerosol_absorption_coefficient..3y.1h.ZA02L_Radiance-Research_PSAP-3W_CPT.ZA02L_abs_coef.lev2.nc
#>         naming_authority: EBAS
#>         project: NOAA-ESRL, GAW-WDCA
#>         acknowledgement: Request acknowledgement details from data originator
#>         license: NOAA-ESRL: , GAW-WDCA: 
#>         summary: Ground based in situ observations of aerosol_absorption_coefficient at Cape Point (ZA0001G) using filter_absorption_photometer. These measurements are gathered as a part of the following projects NOAA-ESRL, GAW-WDCA and they are stored in the EBAS database (http://ebas.nilu.no/). Parameters measured are: aerosol_absorption_coefficient in pm1, aerosol_absorption_coefficient in pm10
#>         source: surface observation
#>         institution: ZA02L, South African Weather Service, Cape Point Global Atmosphere Watch, SAWS, c/o CSIR, P.O. Box 320, 7599, Stellenbosch, South Africa
#>         processing_level: processing_level_test
#>         date_created: 2015-03-28T09:24:23 UTC
#>         date_metadata_modified: 2015-03-28T09:24:23 UTC
#>         creator_name: Casper Labuschagne
#>         creator_type: person
#>         creator_email: 
#>         creator_institution: 
#>         contributor_name: Casper Labuschagne
#>         contributor_role: data submitter
#>         publisher_type: institution
#>         publisher_name: NILU - Norwegian Institute for Air Research, ATMOS, EBAS
#>         publisher_institution: NILU - Norwegian Institute for Air Research, ATMOS, EBAS
#>         publisher_email: ebas@nilu.no
#>         publisher_url: https://www.nilu.no/
#>         geospatial_bounds: POINT Z (-34.35348 18.48968 230.0)
#>         geospatial_bounds_crs: EPSG:4979
#>         geospatial_lat_min: -34.35348
#>         geospatial_lat_max: -34.35348
#>         geospatial_lon_min: 18.48968
#>         geospatial_lon_max: 18.48968
#>         geospatial_vertical_min: 230
#>         geospatial_vertical_max: 230
#>         geospatial_vertical_positive: up
#>         time_coverage_start: 2006-01-01T00:00:00 UTC
#>         time_coverage_end: 2009-01-01T00:00:00 UTC
#>         time_coverage_duration: P0003-00-00T00:00:00
#>         time_coverage_resolution: P0000-00-00T01:00:00
#>         timezone: UTC
#>         Metadata_Conventions: Unidata Dataset Discovery v1.0
#>         geospatial_lat_units: degrees_north
#>         geospatial_lon_units: degrees_east
#>         comment: {
#>     "Data definition": "EBAS_1.1", 
#>     "Set type code": "TU", 
#>     "Timezone": "UTC", 
#>     "File name": "ZA0001G.20060101000000.20150328092423.filter_absorption_photometer.aerosol_absorption_coefficient..3y.1h.ZA02L_Radiance-Research_PSAP-3W_CPT.ZA02L_abs_coef.lev2.nc", 
#>     "File creation": "20190514104410", 
#>     "Startdate": "20060101000000", 
#>     "Revision date": "20150328092423", 
#>     "Version": "1", 
#>     "Version description": "Version numbering not tracked, generated by data.aggregate.ebas r4678", 
#>     "Data level": "2", 
#>     "Period code": "3y", 
#>     "Resolution code": "1h", 
#>     "Sample duration": "1h", 
#>     "Orig. time res.": "1mn", 
#>     "Station code": "ZA0001G", 
#>     "Platform code": "ZA0001S", 
#>     "Station name": "Cape Point", 
#>     "Station GAW-ID": "CPT", 
#>     "Station latitude": "-34.35348", 
#>     "Station longitude": "18.48968", 
#>     "Station altitude": "230.0 m", 
#>     "Measurement height": "30.0 m", 
#>     "Regime": "IMG", 
#>     "Component": "aerosol_absorption_coefficient", 
#>     "Unit": "1/Mm", 
#>     "Laboratory code": "ZA02L", 
#>     "Instrument type": "filter_absorption_photometer", 
#>     "Instrument name": "Radiance-Research_PSAP-3W_CPT", 
#>     "Instrument manufacturer": "Radiance-Research", 
#>     "Instrument model": "PSAP-3W", 
#>     "Instrument serial number": "86", 
#>     "Method ref": "ZA02L_abs_coef", 
#>     "Standard method": "Single-angle_Correction=Bond1999", 
#>     "Inlet type": "Impactor--direct", 
#>     "Humidity/temperature control": "Heating to 40% RH, limit 40 deg. C", 
#>     "Volume std. temperature": "273.15 K", 
#>     "Volume std. pressure": "1013.25 hPa", 
#>     "Detection limit": "0.1 1/Mm", 
#>     "Detection limit expl.": "Determined by instrument noise characteristics, no detection limit flag used", 
#>     "Zero/negative values code": "Zero/negative possible", 
#>     "Zero/negative values": "Zero and neg. values may appear due to statistical variations at very low concentrations", 
#>     "Organization": "ZA02L, \"South African Weather Service, Cape Point Global Atmosphere Watch\", SAWS, , \"c/o CSIR, P.O. Box 320, 7599, Stellenbosch, South Africa\", , , , ", 
#>     "Frameworks": "GAW-WDCA NOAA-ESRL", 
#>     "Originator": "Labuschagne, Casper, , , , , , , , , ", 
#>     "Submitter": "Labuschagne, Casper, , , , , , , , , ", 
#>     "Acknowledgement": "Request acknowledgement details from data originator", 
#>     "Comment": "Bond1999-K1=0.02"
#> }
#>         standard_name_vocabulary: CF-1.7, ACDD-1.3
#>         history: None
#>         creator_url: ebas.nilu.no

# And then close the resource.

ncdf4::nc_close(X)

Created on 2022-04-25 by the reprex package (v2.0.1)

Wonderful and amazing. 👍
Thanks a lot!