AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

v2.4.1 vs v2.2 biocache load of GBIF DWcA - v2.4.1 sorting takes very long time (30x)

jloomisVCE opened this issue · comments

I am working with biocache-store v2.4.1 within bioatlas/ala-docker. While building an alpha site, to bootstrap the db, attempted to load a GBIF download having ~4.3 million records. In v2.4.1, 'biocache load drxx' appeared to hang after retrieving the zip file from the collectory and unzipping. Looking at /data/biocache-load/drxx, the pre-processing step that creates eg. occurrence.txt-sorted was taking a long time - 103 minutes.

I reverted to biocache-store v2.2 within the same bioatlas/ala-docker system. In that case, the same call to 'biocache load drxx' completed the pre-process sorting in 3 minutes.

I believe that the configuration parameters are the same for both, so the difference appears to be the released version.

See attached file.
2.2-vs-2.4.1-biocache-load-dr7-gbif-download.txt

This is possibly a performance regression caused by a fix upstream to using safe CSV sorting rather than the previous unsafe method of hoping that CSV files never contain quoted new-line characters and using the unsafe GNU coreutil sort program.