iterative / dvc

🦉 ML Experiments and Data Management with Git

Home Page:https://dvc.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dvc exp run (or dvc repro) in monorepo: inefficient crawling

tibor-mach opened this issue · comments

Bug Report

Description

In a monorepo scenario with a .dvc directory at the root of te monorepo and multiple subdirectory projects (each with their own dvc.yaml file), dvc repro seems to be checking the entire monorepo even when explicitly given a dvc.yaml file from a subdirectory (and even when run from that subdirectory). I am not sure why it does that but with a particularly large monorepo this can slow things down considerably. For example, with the example repo below when set to 1000 projects this increases the time to run simple experiments from about 2 seconds to about 24 seconds (1000 projects is a lot but they are very simple and their directory structure is also).

Even if the other directories don't have a dvc.yaml file in them at all, dvc repro is still trying to collect stages from there (whereas I would expect it not to even look outside of the PWD).

With dvc exp run the pattern is the same, only a bit more is going on there since the command does more than just dvc repro

Reproduce

There is a testing repo here with instructions on how to test this and reproduce the issue in the README.

Expected

I would be expecting dvc repro to only scan the PWD of the dvc.yaml file (and its subdirectories) and not go through the entire directory tree. The same for dvc exp run.

Additional Information (if any):

Here are some logs that I generated with verbose runs of dvc repro and dvc exp. The first two are outputs when this is run from a single project in a monorepo with 5 projects in total (all of them with their own dvc.yaml). The last one is run in a monorepo with 2 projects, one of which does not contain any dvc.yaml file at all

dvc_repro.log
dvc_exp_run.log
dvc_exp_run_projects_wo_dvc.log

As mentioned in slack, the solution here is to use -s, --single-item, like dvc exp run -s dvc.yaml.