This folder contains configuration files needed to run POET
scripts. Currently used by the
packages dataflow and dbc-influxdb.
Note that the database configuration is not stored in configs
, but in a separate folder that has the same name as
the configs
folder but with the suffix _secret
. If the configs
folder is given e.g. in the script dataflow
as C:\configs
, then the database configuration is assumed to be stored in folder C:\configs_secret
. The _secret
folder only contains one single file dbconf.yaml
with the following info:
url: <URL AND PORT OF DATABASE>
token: <TOKEN FROM INFLUXDB>
org: <ORG>
- Basedirs network addresses, from Windows: server locations for source data dirs, used to search for source data
whendataflow
is run locally on-demand from a Windows computer, therefore locations are given as SMB - Basedirs from gl-calcs: mounts for source data dirs on the
gl-calcs
Linux computer, used to search for source
data whendataflow
is run automatically ongl-calcs
, therefore locations are given as folder path - Output dirs: dirs for writing log files and other info output (csv files)
- Sites subfolders: names of the site-specific subfolders
- Maps found units to a standardized name
- For example,
°C
will be changed todegC
,W/meter²
is changed toW m-2
, etc... - Generally, the key on the left is changed to the value on the right
- Some variables have
false
as value, this means the respective units will not be changed but kept as is.
Example:degC
- In general, the hierarchy is:
configs
>filegroups
><datatype>
><site>
><filegroup>
><filetype>
filegroups
... name of the folder in the repo<datatype>
...raw
orprocessing
<site>
... name of the site, e.g.ch-aws
<filegroup>
...10_meteo
,12_meteo_forestfloor
, etc. Filegroups correspond to the subfolders where we store
respective data files on the server and acts as an additional identifier to group the various filetypes.<filetype>
... file that defines data structures of specific files
- To give an example, the filetype
FRU10-RAW-TBL1-201711201643-TOA5-DAT-1MIN.yaml
is defined in location:configs
>filegroups
>raw
>ch-fru
>10_meteo
>FRU10-RAW-TBL1-201711201643-TOA5-DAT-1MIN.yaml
filetypes
define how the respective raw data files are handled
- The
yaml
files start with the filetype identifier (ID) at the top. - The ID can be anything, there are not restrictions, but current IDs were named in a way that aim to give the most
essential info for this type. - For example, the ID
DAV13-RAW-NABEL-201901010000-CSV-1MIN
gives information about- the site (
DAV
), - filegroup (
13
, which stands for13_meteo_nabel
), - data type (
RAW
for raw data), - origin or data provider (
NABEL
), - the starting datetime when first files of this type are considered (
201901010000
). This starting datetime is set
in relation to the date and time info found in the filename. In this example, with the starting datetime
being201901010000
, a file namedDAV_Meteo_NABEL_190101.CSV
would be valid for this filetype, but not a
fileDAV_Meteo_NABEL_181231.CSV
. However, the starting datetime from the ID is not used to check the datetime
validity of datafiles, but this is done with the settings below, - file delimiter or file extension (
CSV
) and finally the - time resolution (
1MIN
).
- the site (
- The settings for this filetype are listed next.
true
orfalse
- Defines whether the filetype is "seen" during the automatic execution of the
filescanner
script. Useful to exclude
certain files during automatic uploads to the database. For example, final flux calculations are uploaded manually (
on-demand) to the database, but the results files are still stored on the server but should be ignored during the
daily automatic data upload of other datafiles.
- datetime in the format
YYYY-MM-DD hh:mm:ss
- The date/time info is read from the filename and then checked against this setting. It is assumed that the
date/time
info in the filename gives the starting date/time of the file. If the date/time from the filename is **later or equal
** to this setting, the file is valid for the respective filetype. - Example:
filetype_valid_from: 2019-01-01 00:00:00
siteFile_20190419.CSV
is validsiteFile_20181231.CSV
is NOT valid
- datetime in the format
YYYY-MM-DD hh:mm:ss
- The date/time info is read from the filename and then checked against this setting. It is assumed that the
date/time
info in the filename gives the starting date/time of the file. If the date/time from the filename is earlier or
equal to this setting, the file is valid for the respective filetype. - Example:
filetype_valid_to: 2019-06-15 23:59:59
siteFile_20190419.CSV
is validsiteFile_20190727.CSV
is NOT valid
- string to identify files of this filetype
- Example:
FILE_*.dat
for the fileFILE_20231201-1450.dat
- parsing string to parse datetime info from filename
- Example:
FILE_%Y%m%d-%H%M.dat
for the fileFILE_20231201-1450.dat
true
orfalse
- select
true
to directly use.gz
compressed files
- string
- Example:
13_meteo_nabel
- string that describes the (nominal) time resolution of data files, e.g.
30T
- can be a list of strings for
-ALTERNATING-
filetypes, e.g.[ 30T, irregular ]
- follows the convention of
thepandas
period aliases T
for 1MIN time resolution,30T
for 30MIN time resolution,1H
for hourly etc...
false
orlist
ofint
or emptylist
- defines which rows should be ignored
- typically used to ignore rows at the start of the files
- important in connection with
data_headerrows
- Example:
[ ]
to not ignore any row - Example:
[ 0, 3 ]
to ignore the first and fourth rows in the file
list
ofint
, can befalse
- defines where to find the header of the file
- defines where to find info about variables and units
- This is typically
[ 0, 1]
if the files contain variable names (first row) and units (second row), or[ 0 ]
if the
files contain only variable names (first row). - Is
false
if the file does not contain any header row, this is the case especially for older files.
int
orstr
or-9999
- location of the timestamp column
- Example:
0
if the timestamp is found in the first column of the file - Example:
TIMESTAMP_END
if the timestamp is found in the column with this name - Example:
-9999
if there is no timestamp info in the file. Instead, in this case the timestamp has to be constructed
from other available time/date info using a method defined indata_build_timestamp
.
int
- offset of the timestamp in relation to UTC
- important because the database stores all data in UTC
- Example:
1
sets the data timestamp index to timezoneUTC+01:00
, which corresponds to CET. Note that the timestamp
per se is not altered, only the timezone info is added.
string
orfalse
- parsing string to parse the datetime info
- Examples:
'%Y-%m-%d %H:%M:%S'
,'%Y-%m-%d %H:%M:%S.%f'
false
if there is no timestamp info in the file. Instead, in this case the timestamp has to be constructed from
other available time/date info using a method defined indata_build_timestamp
.
false
if there is a timestamp in the file and the timestamp column can be parsed- If there is no timestamp column in the file, a timestamp can be constructed with:
"YEAR0+MONTH1+DAY2+HOUR3+MINUTE4"
to build the timestamp from columns that give the year (first column, column
index 0), month (second column), day (third column), hour (fourth column) and minutes (fifth column, column index
4)."YEAR+DOY+TIME"
to build the timestamp from the columnsYEAR
,DOY
andTIME
.- In these cases the
data_timestamp_column
must be-9999
because there is no index column in these data files. - In these cases
data_timestamp_format
must befalse
because there is no index column in these data files.
list
ofint
orfalse
- some files have an identifier in the first column that identifies good data rows
- this setting was introduced because some data files stored data from different data sources in the same file-
[ 0, 104 ]
keeps all data rows where the data row starts with104
, whereby0
means that the104
is searched in
the first column[ 0, 102, 202 ]
keeps all data rows where the data row starts with102
or202
, whereby0
means that102
and202
are searched in the first column. In this case the variables for ID102
are described indata_vars
, for
ID202
indata_vars2
. Files with this setting produce two dataframes, one for each ID.- Different IDs can have different time resolutions, see setting
data_raw_freq
. - Yes this makes a lot much more confusing, doesn't it?
false
in almost all cases- However, there were filetypes where this setting was necessary to ignore unconventional data rows.
- The setting
[ 0, "-999.9000-999.9000-999.9000-999.9000-999.9000"]
was used for
filetypesDAV17-RAW-P2-200001010000-NABEL-PRF-SSV-DAT-5MIN
andDAV17-RAW-P2-200601010000-NABEL-PRF-SSV-DAT-5MIN
to
ignore irregular data rows.
list
- defines which values to interpret as NAN (not a number, i.e. missing data)
- currently
[ -9999, nan, NaN, NAN, -6999, '-' ]
for all files - during
dataflow
script execution there are some other safeguards implemented regarding NANs, e.g. some files have
the stringsinf
and-inf
in their data that which are then removed during runtime. These two strings are
interpreted as number for whatever reason and cannot be included in thedata_na_values
setting, I think because if
they are added as strings here then they are interpreted as strings, but in reality they are a number to Python.
Something along these lines...
false
orstring
"-ALTERNATING-"
identifies special formats that store data from multiple data sources, see
alsodata_keep_good_rows
"-ICOSSEQ-"
identifies special formats that store data in the ICOS long-form format- Data that have a special format are converted to a more regular format during the execution of
dataflow
. - These data formats can also be identified from the filetype ID,
e.g.,DAV10-RAW-PROFILE-200811211210-ALTERNATING-A-10MIN
.
utf-8
in almost all casescp1252
is used forDAV13-RAW-NABEL-*
files, see here for an
explanation about this encoding.
','
in most cases';'
for some files'\s+'
for NABEL files, e.g.,DAV17-RAW-P2-200001010000-NABEL-PRF-SSV-DAT-5MIN
false
in all cases so far- means that the original datetime column(s) used to parse or construct the timestamp is removed
string
, used to describe the version of the dataraw
: raw dataeddypro_level-0
: Level-0 (preliminary) flux data,
see Flux Processing Chaineddypro_level-1
: Level-1 flux data,
see Flux Processing Chainfluxnet_ww2020
: FLUXNET/ICOS Warm Winter 2020 ecosystem eddy covariance flux product release 2022-1meteoscreening_mst
: quality-screened meteo data, using the old Python MeteoScreeningToolmeteoscreening_diive
: quality-screened meteo data, usingdiive
, e.g. using its meteoscreening notebooks- more will be added in the future
true
for raw data variables that typically contain info about their location in the (standardized) variable name,
e.g..TA_T1_2_1
is the air temperature on the main tower (T1
), at2
m height above ground, replicate1
. This
location info is parsed and then stored as separate tags alongside the variable in the database.T1
is stored
ashpos
(horizontal position),2
asvpos
(vertical position) and1
asrepl
(replicate number).false
for data that do not have position indexes, e.g., flux calculation simply output the calculated variable.- This behavior can be overriden for specific variables for which position indices are available, by
setting
parse_pos_indices: true
for the respective variables.
string
free
means that the variables listed underdata_vars
are listed in no particular order, the variable names appear
in the files.strict
means the variables listed indata_vars
are listed in sequence and the sequence must not be changed
because the files do not contain variable names. The variable names are directly taken from thedata_vars
.
- Gives info about the variables found in the file with the format:
<RAWVAR>: { field: <VAR>, units: <UNITS>, measurement: <MEASUREMENT> }
<RAWVAR>
... name of original raw data variable, e.g.PT100_2_AVG
<VAR>
... name of renamed variable, following naming convention, e.g.T_RAD_T1_2_1
<UNITS>
...false
if units are given in data file, otherwise a string e.g.degC
; units ofVAR
, after
applyinggain
, e.g.degC
- There are some optional parameters that can be
used:<RAWVAR>: { field: <VAR>, units: <UNITS>, gain: <GAIN>, offset: <OFFSET>, parse_pos_indices: <PARSE_POSITION_INDICES>, rawfunc: <RAWFUNC>, measurement: <MEASUREMENT> }
<GAIN>
... OPTIONAL gain (float
) that is applied to<RAWVAR>
before upload to the
database,<UNITS>
describes the units of<RAWVAR>
after the application of<GAIN>
. Assumed1.0
(
float) if not given. Typically used to e.g. convert soil water content fromm3 m-3
to%
by
applyinggain: 100.0
(float).<OFFSET>
... OPTIONAL offset (float
) that is applied to<RAWVAR>
before upload to the
database,<UNITS>
describes the units of<RAWVAR>
after the application of<OFFSET>
. Assumed0.0
(
float) if not given. Example:offset: 14.0
(float).<PARSE_POSITION_INDICES>
... OPTIONAL,true
orfalse
Iftrue
, parses the position indices from the respective variable, using the name provided in<VAR>
, even if the settingdata_vars_parse_pos_indices
(at the file-level, see above) is set tofalse
. For example, for the variableSWC_GF1_0.1_2
the following position indices are parsed: horizontal positionGF1
, vertical position0.1
and replicate2
. This setting is useful for files that do not contain position indices for all, but for some variables. Is assumedfalse
if not given.<RAWFUNC>
... OPTIONAL list; function executed on raw data to produce a new variable, e.g. for the
calculation ofLW_IN_T1_2_1
fromPT100_2_AVG
andLWin_2_AVG
, using the functioncalc_lwin
. The
relevant function is defined in the Python script dataflow.
Important:rawfunc: <RAWFUNC>
must not be given if no rawfunc is required, this means thatrawfunc: false
will not work. Currently there are some rawfuncs defined where they were needed, see the List of rawfuncs
below. The currently implemented functions are shown in
the dataflow repo here.
- Calculate long-wave incoming radiation (
LW_IN
):
LWin_1_AVG: { field: LW_IN_RAW_T1_2_1, units: false, rawfunc: [ calc_lw, PT100_1_AVG, LWin_1_AVG, LW_IN_T1_2_1 ], measurement: _RAW }
The rawfunccalc_lw
is used to calculate the new variableLW_IN_T1_2_1
from the available raw data
variablesPT100_2_AVG
(temperature of the radiation sensor in °C) andLWin_1_AVG
(raw signal of LW_IN in mV). - Calculate soil water content (
SWC
) from SDP:
Theta_11_AVG: { field: SDP_GF1_0.05_1, units: mV, rawfunc: [ calc_swc ], measurement: SDP }
The rawfunccalc_swc
is used to calculate the new variableSWC
fromSDP
(soil dielectric permittivity,
unitless), whereby in this example the original raw name forSDP
is calledTheta_11_AVG
. The calculation is
site-specific,dataflow
checks the site and then applies the correct function to runcalc_swc
. - Temperature-correction for
O2
measurements:
O2_GF4_0x1_1_Avg: { field: O2_GF4_0.1_1, units: false, rawfunc: [ correct_o2, O2_GF4_0x1_1_Avg, TO2_GF4_0x1_1_Avg ], measurement: O2 }
The rawfunccorrect_o2
is used to calculate temperature-corrected soil O2 (in %) from the original O2
measurementO2_GF4_0x1_1_Avg
(in %) andTO2_GF4_0x1_1_Avg
(temperature of the O2 sensor in degC). The calculation
is site-specific,dataflow
checks the site and then applies the correct function to runcorrect_o2
. The original
O2 measurement is replaced by the corrected version.- Apply gain between dates:
SHF_2_AVG: { field: G_GF1_0.03_2, units: W m-2, gain: 1.0, rawfunc: [ apply_gain_between_dates, "2010-03-31 10:30:00", "2010-07-28 09:30:00", 1.0115667782544568 ], measurement: G }
The rawfuncapply_gain_between_dates
is used to apply gain1.0115667782544568
to the variableSHF_2_AVG
, but
only between the provided dates, in this case all values between2010-03-31 10:30:00
and2010-07-28 09:30:00
are multiplied by the gain. The dates include the time and are inclusive (gain is applied also to the provided
start and stop dates). Important: for this rawfunc the regular gain of the respetive time series also needs to be
provided as float, heregain: 1.0
. The original measurementSHF_2_AVG
is replaced by the corrected values and
is stored to the database asG_GF1_0.03_2
. - Add offset between dates:
TS_GF1_0x40_1: { field: TS_GF1_0.4_1, units: false, offset: 0.0, rawfunc: [ add_offset_between_dates, "2018-11-04 17:59:00", "2018-12-20 10:33:00", 52 ], measurement: TS }
``
The rawfuncadd_offset_between_dates
is used to add offset `52` to the variable `TS_GF1_0x40_1`, but only
between the provided dates, in this case the offset is added to all values between `2018-11-04 17:59:00`
and `2018-12-20 10:33:00`. The dates include the time and are inclusive (offset is added also to the provided
start and stop dates). Important: for this rawfunc the regular offset of the respetive time series also needs to
be provided as float, here `offset: 0.0`. The original measurement `TS_GF1_0x40_1` is replaced by the corrected
values and is stored to the database as `TS_GF1_0.4_1`.
- Apply gain between dates:
- same structure as
data_vars