Known issues

Question

vmikk opened this issue 2 years ago · comments

Data filtering by occurrence issues not implemented.
Currently, Apache Arrow does not support filtering data using columns with array-type data (e.g., the issue column in the GBIF occurrence dump). See ARROW-16702 [apache/arrow issue 31991] and ARROW-16641 [apache/arrow issue 32045] . It is possible to filter using DuckDB, but the query consumes >100 GB RAM.
Estimation of PD for grid cells with a single species
Estimation of memory required for each process not implemented
(could be useful if the pipeline will be launched on HPC or in the cloud)
Given the size of the input, it is possible to guess the amount of RAM required for a task.
But we need to collect the data for various use-case scenarios to make a raw estimate.