dvc exp run (or dvc repro) in monorepo: inefficient crawling
tibor-mach opened this issue · comments
Bug Report
Description
In a monorepo scenario with a .dvc
directory at the root of te monorepo and multiple subdirectory projects (each with their own dvc.yaml
file), dvc repro
seems to be checking the entire monorepo even when explicitly given a dvc.yaml
file from a subdirectory (and even when run from that subdirectory). I am not sure why it does that but with a particularly large monorepo this can slow things down considerably. For example, with the example repo below when set to 1000 projects this increases the time to run simple experiments from about 2 seconds to about 24 seconds (1000 projects is a lot but they are very simple and their directory structure is also).
Even if the other directories don't have a dvc.yaml
file in them at all, dvc repro
is still trying to collect stages from there (whereas I would expect it not to even look outside of the PWD).
With dvc exp run
the pattern is the same, only a bit more is going on there since the command does more than just dvc repro
Reproduce
There is a testing repo here with instructions on how to test this and reproduce the issue in the README.
Expected
I would be expecting dvc repro
to only scan the PWD of the dvc.yaml
file (and its subdirectories) and not go through the entire directory tree. The same for dvc exp run
.
Additional Information (if any):
Here are some logs that I generated with verbose runs of dvc repro
and dvc exp
. The first two are outputs when this is run from a single project in a monorepo with 5 projects in total (all of them with their own dvc.yaml
). The last one is run in a monorepo with 2 projects, one of which does not contain any dvc.yaml
file at all
dvc_repro.log
dvc_exp_run.log
dvc_exp_run_projects_wo_dvc.log
As mentioned in slack, the solution here is to use -s, --single-item
, like dvc exp run -s dvc.yaml
.