vmikk / PhyloNext

A pipeline for phylogenetic diversity analysis of GBIF-mediated data

Home Page:https://phylonext.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Known issues

vmikk opened this issue · comments

  • Data filtering by occurrence issues not implemented.
    Currently, Apache Arrow does not support filtering data using columns with array-type data (e.g., the issue column in the GBIF occurrence dump). See ARROW-16702 [apache/arrow issue 31991] and ARROW-16641 [apache/arrow issue 32045] . It is possible to filter using DuckDB, but the query consumes >100 GB RAM.

  • Estimation of PD for grid cells with a single species

  • Estimation of memory required for each process not implemented
    (could be useful if the pipeline will be launched on HPC or in the cloud)
    Given the size of the input, it is possible to guess the amount of RAM required for a task.
    But we need to collect the data for various use-case scenarios to make a raw estimate.