datalad / datalad

Keep code, data, containers under control with git and git-annex

Home Page:http://datalad.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`datalad save` slows down superlinearly with larger numbers of files (but not git annex add)

psadil opened this issue · comments

What is the problem?

I'm working with rather large datasets (10k - 100K files) and datalad save is taking much longer than seems necessary. For example, in one dataset, the progress bar for datalad save estimates that the process will take around 36 hours. However, git/git-annex commands are able to finish in about 20 minutes.

I've been testing on my local laptop (Apple M1), where one issue seems to be that datalad slows down superlinearly with increasing numbers of files. The following plot shows the time it takes datalad vs git/git-annex alone to add/commit 1K, 10K, or 100K files, relative to working with 1K files.

image

In particular, notice that the relative speed for datalad at 100K files (rightmost point) is over 100, whereas it is less than 100 for annex (also, -J doesn't help datalad save?).

FWIW, the datasets live on a cluster that uses a nfs filesystem. On that cluster, jobs have a maximum runtime of 48 hours. To test that the jobs would be able to finish in time, I ran smaller batches (e.g., trying to save datasets that were 1/6 of the final size). Those estimates ended up being wildly wrong, and my current suspicion for why is that it relates to this superlinear slowdown.

What steps will reproduce the problem?

Here's a script to test the timing

#!/bin/bash

save() {
    local cmd=$1
    local J=$2
    case $cmd in
        datalad)
            datalad  -l critical save -J "${J}" .
            ;;
        annex)
            git annex add -q -J "${J}" . \
            && git commit -m "annexed"
            ;;
    esac
}
export -f save
TIMEFORMAT=%R 

cwd=$PWD
log=$cwd/times.csv
echo "n_files,bytes,cmd,J,seconds" > "${log}"

for bytes in 1; do
    echo "bytes: ${n_files}"
    for n_files in 1000 10000 100000; do
        echo "n_files: ${n_files}"
        for cmd in datalad annex; do
            echo "cmd: ${cmd}"
            for J in 1 4; do
                ds=/tmp/dataset
                mkdir ${ds} || exit \
                && cd $ds \
                && datalad create -d $ds
                for file in $(seq 1 $n_files); do
                    truncate -s "${bytes}" "${file}".txt
                done
                echo "J: ${J}"
                cmdlog="${cwd}/cmd=${cmd}_n=${n_files}_J=${J}.log"
                { time save "${cmd}" "${J}" ;} &> "${cmdlog}"
                seconds=$(grep -E "^[0-9]" "${cmdlog}")
                printf "%d,%d,%s,%d,%f \n" "${n_files}" "${bytes}" "${cmd}" "${J}" "${seconds}" >> "${log}"
                cd "${cwd}" \
                && chmod -R a+rw "${ds}" \
                && rm -rf "${ds}"
            done
        done
    done
done

This will produce a files.tsv

n_files,bytes,cmd,J,seconds
1000,1,datalad,1,2.089000 
1000,1,datalad,4,2.124000 
1000,1,annex,1,1.735000 
1000,1,annex,4,0.979000 
10000,1,datalad,1,16.905000 
10000,1,datalad,4,16.626000 
10000,1,annex,1,15.120000 
10000,1,annex,4,7.176000 
100000,1,datalad,1,240.631000 
100000,1,datalad,4,243.603000 
100000,1,annex,1,153.590000 
100000,1,annex,4,72.167000 

If you want to make that plot (R script)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(readr)
read_csv("times.csv") |> 
  mutate(s10k = seconds[which(n_files==1000)], .by=c(J,cmd)) |> 
  mutate(J=factor(J), seconds=(seconds/s10k) ) |> 
  ggplot(aes(x=n_files, y=seconds, color=cmd)) + 
  geom_line(aes(linetype=J)) + 
  geom_point() + 
  scale_x_log10(name = "N Files") + 
  scale_y_continuous(
    breaks = seq(0,120, by=20), 
    labels = seq(0,120, by=20), 
    name = "Time Relative to 1K Files") + 
  scale_color_viridis_d(option = "turbo", begin = 0.2, end=0.8)
#> Rows: 12 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): cmd
#> dbl (4): n_files, bytes, J, seconds
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Created on 2023-10-09 with reprex v2.0.2

R Session info
sessionInfo()
#> R version 4.2.3 (2023-03-15)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.5.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] readr_2.1.4   ggplot2_3.4.3 dplyr_1.1.3  
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_4.2.3    pillar_1.9.0      highr_0.9         R.methodsS3_1.8.2
#>  [5] R.utils_2.12.2    tools_4.2.3       bit_4.0.5         digest_0.6.29    
#>  [9] viridisLite_0.4.2 evaluate_0.15     lifecycle_1.0.3   tibble_3.2.1     
#> [13] gtable_0.3.4      R.cache_0.16.0    pkgconfig_2.0.3   rlang_1.1.1      
#> [17] reprex_2.0.2      cli_3.6.1         rstudioapi_0.15.0 curl_5.1.0       
#> [21] parallel_4.2.3    yaml_2.3.5        xfun_0.39         fastmap_1.1.0    
#> [25] xml2_1.3.5        httr_1.4.7        withr_2.5.1       styler_1.10.2    
#> [29] stringr_1.5.0     knitr_1.39        hms_1.1.3         generics_0.1.3   
#> [33] fs_1.6.3          vctrs_0.6.3       bit64_4.0.5       grid_4.2.3       
#> [37] tidyselect_1.2.0  glue_1.6.2        R6_2.5.1          fansi_1.0.4      
#> [41] vroom_1.6.4       rmarkdown_2.14    farver_2.1.1      tzdb_0.4.0       
#> [45] purrr_1.0.2       magrittr_2.0.3    scales_1.2.1      htmltools_0.5.4  
#> [49] mime_0.12         colorspace_2.1-0  utf8_1.2.3        stringi_1.7.6    
#> [53] munsell_0.5.0     crayon_1.5.2      R.oo_1.25.0

DataLad information

> `datalad --version`
datalad 0.19.3

> git annex version
git-annex version: 10.20230926
build flags: Assistant Webapp Pairing FsEvents TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV
dependency versions: aws-0.24.1 bloomfilter-2.0.1.2 crypton-0.33 DAV-1.3.4 feed-1.3.2.1 ghc-9.4.4 http-client-0.7.14 persistent-sqlite-2.13.1.1 torrent-10000.1.3 uuid-1.3.15 yesod-1.6.2.1
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: darwin aarch64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
datalad wtf
> datalad wtf
# WTF
## configuration <SENSITIVE, report disabled by configuration>
## credentials 
  - keyring: 
    - active_backends: 
      - macOS Keyring
      - PlaintextKeyring with no encyption v.1.0 at /Users/psadil/.local/share/python_keyring/keyring_pass.cfg
    - config_file: /Users/psadil/.config/python_keyring/keyringrc.cfg
    - data_root: /Users/psadil/.local/share/python_keyring
## datalad 
  - version: 0.19.3
## dataset 
  - branches: 
    - enh/parallel-copy@39ad344
    - main@b22a00c
  - id: None
  - path: /Users/psadil/git/a2cps/release
  - repo: GitRepo
## dependencies 
  - annexremote: 1.2.1
  - boto: 2.49.0
  - cmd:7z: 16.02
  - cmd:annex: 10.20230926
  - cmd:bundled-git: UNKNOWN
  - cmd:git: 2.42.0
  - cmd:ssh: 9.0p1
  - cmd:system-git: 2.42.0
  - cmd:system-ssh: 9.0p1
  - humanize: 4.8.0
  - iso8601: 2.1.0
  - keyring: 24.2.0
  - keyrings.alt: 4.2.0
  - msgpack: 1.0.6
  - platformdirs: 3.11.0
  - requests: 2.31.0
## environment 
  - LANG: en_US.UTF-8
  - PATH: /Users/psadil/mambaforge/envs/snapshot-v-test/bin:/Users/psadil/mambaforge/condabin:/Users/psadil/fsl/share/fsl/bin:/Users/psadil/fsl/share/fsl/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Library/Apple/usr/bin:/Applications/quarto/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Users/psadil/Applications/quarto/bin:/Users/psadil/.local/bin:/Users/psadil/.local/bin
## extensions 
## git-annex 
  - build flags: 
    - Assistant
    - Webapp
    - Pairing
    - FsEvents
    - TorrentParser
    - MagicMime
    - Benchmark
    - Feeds
    - Testsuite
    - S3
    - WebDAV
  - dependency versions: 
    - aws-0.24.1
    - bloomfilter-2.0.1.2
    - crypton-0.33
    - DAV-1.3.4
    - feed-1.3.2.1
    - ghc-9.4.4
    - http-client-0.7.14
    - persistent-sqlite-2.13.1.1
    - torrent-10000.1.3
    - uuid-1.3.15
    - yesod-1.6.2.1
  - key/value backends: 
    - SHA256E
    - SHA256
    - SHA512E
    - SHA512
    - SHA224E
    - SHA224
    - SHA384E
    - SHA384
    - SHA3_256E
    - SHA3_256
    - SHA3_512E
    - SHA3_512
    - SHA3_224E
    - SHA3_224
    - SHA3_384E
    - SHA3_384
    - SKEIN256E
    - SKEIN256
    - SKEIN512E
    - SKEIN512
    - BLAKE2B256E
    - BLAKE2B256
    - BLAKE2B512E
    - BLAKE2B512
    - BLAKE2B160E
    - BLAKE2B160
    - BLAKE2B224E
    - BLAKE2B224
    - BLAKE2B384E
    - BLAKE2B384
    - BLAKE2BP512E
    - BLAKE2BP512
    - BLAKE2S256E
    - BLAKE2S256
    - BLAKE2S160E
    - BLAKE2S160
    - BLAKE2S224E
    - BLAKE2S224
    - BLAKE2SP256E
    - BLAKE2SP256
    - BLAKE2SP224E
    - BLAKE2SP224
    - SHA1E
    - SHA1
    - MD5E
    - MD5
    - WORM
    - URL
    - X*
  - operating system: darwin aarch64
  - remote types: 
    - git
    - gcrypt
    - p2p
    - S3
    - bup
    - directory
    - rsync
    - web
    - bittorrent
    - webdav
    - adb
    - tahoe
    - glacier
    - ddar
    - git-lfs
    - httpalso
    - borg
    - hook
    - external
  - supported repository versions: 
    - 8
    - 9
    - 10
  - upgrade supported from repository versions: 
    - 0
    - 1
    - 2
    - 3
    - 4
    - 5
    - 6
    - 7
    - 8
    - 9
    - 10
  - version: 10.20230926
## location 
  - path: /Users/psadil/git/a2cps/release
  - type: dataset
## metadata.extractors 
## metadata.filters 
## metadata.indexers 
## python 
  - implementation: CPython
  - version: 3.11.6
## system 
  - distribution: darwin/22.6.0 13.5.2/arm64
  - encoding: 
    - default: utf-8
    - filesystem: utf-8
    - locale.prefered: UTF-8
  - filesystem: 
    - CWD: 
      - max_pathlength: 1024
      - mount_opts: ro,local,rootfs,dovolfs,journaled,multilabel
      - path: /Users/psadil/git/a2cps/release/tests
      - type: apfs
    - HOME: 
      - max_pathlength: 1024
      - mount_opts: ro,local,rootfs,dovolfs,journaled,multilabel
      - path: /Users/psadil
      - type: apfs
    - TMP: 
      - max_pathlength: 1024
      - mount_opts: ro,local,rootfs,dovolfs,journaled,multilabel
      - path: /var/folders/v_/kcpb096s1m3_37ctfd2sp2xm0000gn/T
      - type: apfs
  - max_path_length: 293
  - name: Darwin
  - release: 22.6.0
  - type: posix
  - version: Darwin Kernel Version 22.6.0: Wed Jul  5 22:22:05 PDT 2023; root:xnu-8796.141.3~6/RELEASE_ARM64_T6000
mamba env
name: snapshot-v-test
channels:
  - conda-forge
dependencies:
  - annexremote=1.2.1=py_0
  - boto=2.49.0=py_0
  - brotli-python=1.1.0=py311ha891d26_1
  - bzip2=1.0.8=h3422bc3_4
  - c-ares=1.20.1=h93a5062_0
  - ca-certificates=2023.7.22=hf0a4a13_0
  - certifi=2023.7.22=pyhd8ed1ab_0
  - chardet=5.2.0=py311h267d04e_1
  - charset-normalizer=3.3.0=pyhd8ed1ab_0
  - colorama=0.4.6=pyhd8ed1ab_0
  - curl=8.3.0=hc52a3a8_0
  - datalad=0.19.3=py311h267d04e_0
  - distro=1.8.0=pyhd8ed1ab_0
  - exifread=3.0.0=pyhd8ed1ab_0
  - fasteners=0.17.3=pyhd8ed1ab_0
  - freetype=2.12.1=hadb7bae_2
  - gettext=0.21.1=h0186832_0
  - git=2.42.0=pl5321h46e2b6d_0
  - humanize=4.8.0=pyhd8ed1ab_0
  - idna=3.4=pyhd8ed1ab_0
  - importlib-metadata=6.8.0=pyha770c72_0
  - importlib_metadata=6.8.0=hd8ed1ab_0
  - iso8601=2.1.0=pyhd8ed1ab_0
  - jaraco.classes=3.3.0=pyhd8ed1ab_0
  - keyring=24.2.0=py311h267d04e_1
  - keyrings.alt=4.2.0=pyhd8ed1ab_0
  - krb5=1.21.2=h92f50d5_0
  - lcms2=2.15=hf2736f0_3
  - lerc=4.0.0=h9a09cb3_0
  - libcurl=8.3.0=hc52a3a8_0
  - libcxx=16.0.6=h4653b0c_0
  - libdeflate=1.19=hb547adb_0
  - libedit=3.1.20191231=hc8eb9b7_2
  - libev=4.33=h642e427_1
  - libexpat=2.5.0=hb7217d7_1
  - libffi=3.4.2=h3422bc3_5
  - libiconv=1.17=he4db4b2_0
  - libjpeg-turbo=3.0.0=hb547adb_1
  - libnghttp2=1.52.0=hae82a92_0
  - libpng=1.6.39=h76d750c_0
  - libsqlite=3.43.0=hb31c410_0
  - libssh2=1.11.0=h7a5bd25_0
  - libtiff=4.6.0=ha8a6c65_2
  - libwebp-base=1.3.2=hb547adb_0
  - libxcb=1.15=hf346824_0
  - libzlib=1.2.13=h53f4e23_5
  - looseversion=1.3.0=pyhd8ed1ab_0
  - more-itertools=10.1.0=pyhd8ed1ab_0
  - msgpack-python=1.0.6=py311he4fd1f5_0
  - mutagen=1.47.0=pyhd8ed1ab_0
  - ncurses=6.4=h7ea286d_0
  - openjpeg=2.5.0=h4c1507b_3
  - openssl=3.1.3=h53f4e23_0
  - p7zip=16.02=hbdafb3b_1001
  - patool=1.12=py311h267d04e_1007
  - pcre2=10.40=hb34f9b4_0
  - perl=5.32.1=4_hf2054a2_perl5
  - pillow=10.0.1=py311h8dc27b9_2
  - pip=23.2.1=pyhd8ed1ab_0
  - platformdirs=3.11.0=pyhd8ed1ab_0
  - psutil=5.9.5=py311heffc1b2_1
  - pthread-stubs=0.4=h27ca646_1001
  - pyperclip=1.8.2=pyhd8ed1ab_2
  - pysocks=1.7.1=pyha2e5f31_6
  - python=3.11.6=h47c9636_0_cpython
  - python-dateutil=2.8.2=pyhd8ed1ab_0
  - python-gitlab=3.15.0=pyhd8ed1ab_0
  - python_abi=3.11=4_cp311
  - readline=8.2=h92ec313_1
  - requests=2.31.0=pyhd8ed1ab_0
  - requests-ftp=0.3.1=py_1
  - requests-toolbelt=1.0.0=pyhd8ed1ab_0
  - setuptools=68.2.2=pyhd8ed1ab_0
  - simplejson=3.19.2=py311h05b510d_0
  - six=1.16.0=pyh6c4a22f_0
  - tk=8.6.13=hb31c410_0
  - tqdm=4.66.1=pyhd8ed1ab_0
  - typing-extensions=4.8.0=hd8ed1ab_0
  - typing_extensions=4.8.0=pyha770c72_0
  - tzdata=2023c=h71feb2d_0
  - urllib3=2.0.6=pyhd8ed1ab_0
  - wheel=0.41.2=pyhd8ed1ab_0
  - whoosh=2.7.4=py311h267d04e_8
  - xorg-libxau=1.0.11=hb547adb_0
  - xorg-libxdmcp=1.1.3=h27ca646_0
  - xz=5.2.6=h57fd34a_0
  - zipp=3.17.0=pyhd8ed1ab_0
  - zstd=1.5.5=h4f39d0f_0

Additional context

#4630 (and associated links). Perhaps this the degree of superlinear slowdown is one way to motivate/target performance issues?

I understand the point of #6977 to be for allowing datalad save to use git annex --batch. That addition seems likely to help.

Have you had any success using DataLad before?

yes