CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Automatic batch sizing

kylebgorman opened this issue · comments

PyTorch Lightning supports a mode where it automatically computes the maximum batch size on your accelerator (by collecting gradients over a few batches and then using binary search to find a max that doesn't give OOM errors). Documentation is here. It appears straightforward to implement (see "Note" on docs.)

Having this enabled would be very useful for computing the maximum batch size possible. From this one could factor the desired batch size and accumulate gradients across multiple batches per optimizer step, so this is related to #132 in the obvious way.

This will also combine nicely with a move to LightningCLI, IMO, since that uses a subcommand interface and we could do something like yoyodyne scale_batch_size --arg1 --arg2 to enable this.

commented

I think I am on the hook to finally do this. Current thoughts after looking at previous discussions:

  • A user specifies a --batch_size
  • Optionally, a user can specify a --max_batch_size, in which case we recompute something like batch_size_prime and a number of gradient accumulation steps k so that we do not put > max_batch_size on GPU at a time. With these + PTL features, we can ask the trainer to accumulate gradients thus simulating an effective batch size of batch_size automatically.
    - I guess we can use your implementation here: https://gist.github.com/kylebgorman/178f6fe9a7b83286d0d30c8136575c81 for that (I have not looked at it closely FYI).
  • Instead of --max_batch_size a user can specify something like --find_max_batch_size. Then, we set max_batch_size automatically using the PTL feature, and if max_batch_size > batch_size, we compute batch_size_prime and k to automatically accumulate gradients.

This means we add two new args: --max_batch_size and --find_max_batch_size, where --max_batch_size should not be set if --find_max_batch_size is requested. I think supporting setting --max_batch_size manually makes sense as the finding step can be skipped if I have already found it for my exp + GPU, and I just want to enter it manually instead of using compute.

Does this seem right? I think this could go in a combinaiton of util.py and train.py but we can make another module too if need be.

Let's define two things: "notional batch size" is the number of elements per training step, and "effective batch size" is the size of the minibatches of which there is at least one per step. Then:

(A) effective batch size x N = notional batch size

where N is a positive counting number >= 1. Users need to be able to:

  1. train by specifying effective batch size and N
  2. find the largest batch size possible, and
  3. compute, for a given notional batch size, the largest effective batch size (and the smallest N) that satisfies (A).

Currently, effective batch size should be specified with --batch_size and N with --accumulate_grad_batches. That seems fine and satisfies (1).

In a perfect world, I think (2) would be done with a separate CLI tool called yoyodyne-tune maybe. It would inherit all, or nearly all, the flags from the trainer, and just call the tuner instead. (If this is done I can also move the --auto_lr-find tuner into this CLI tool too.) One issue in the design of (2) is how it can be integrated into the sweeping code. It is my contention that we would want to sample --max_batch_size rather than batch size during sweeps, apply (2) and (3), and then train with that. And that needs to be feasible. Any thoughts on this issue?

As I understand it you're proposing putting (3) in (1) which makes it more general; i.e., you can use it without using (2), e.g. if you think you know what the answer to (2) is, and I think I agree. So then you'd just add --max_batch_size as a flag as you propose.

I think yes, you'd stash this all in train.py unless you're creating a new binary, in which case you'd import a lot of the functionality from train.py (and probably lightly refactor it).

My implementation of the solver for (A) appears to work. I don't know if the brute force or elegant solver is better---but I can do a timing experiment.

One last thing, you may want to look at the PTL 2 interface to make sure it's not radically different.

Timing results below.

Elegant solution

$ python -m timeit -r 10 -s 'import size; import random; random.seed(1985)' 'size.size_elegant(random.randrange(2 ** 7, 2 ** 12, 2 ** 7), random.randrange(16, 1000))'
50000 loops, best of 10: 6.37 usec per loop

Brute force solution

$ python -m timeit -r 10 -s 'import size; import random; random.seed(1985)' 'size.size_brute_force(random.randrange(2 ** 7, 2 ** 12, 2 ** 7), random.randrange(16, 1000))'
50000 loops, best of 10: 4.17 usec per loop

Conclusions

Brute force is always a little faster, even when I try different ranges of values and not accounting for the time spent importing sympy. (I believe it is already an indirect dependency of our system, since I see it getting installed when I install Yoyodyne on a fresh system.) So I would be fine just using it if you want.

We can write that one can solve this elegantly by computing all divisors of the notional batch size, but we find that a simple brute force solution from the top to bottom is slightly faster.

commented

Let's define two things: "notional batch size" is the number of elements per training step, and "effective batch size" is the size of the minibatches of which there is at least one per step. Then:

(A) effective batch size x N = notional batch size

I think for consistency we may want to define tehm in the opposite, in general. See here in PTL docs: https://lightning.ai/docs/pytorch/stable/advanced/training_tricks.html "Accumulated gradients run K small batches of size N before doing a backward pass. The effect is a large effective batch size of size KxN, where N is the batch size.", I have this this usage of "effective batch size" in HF argument names too I think.

where N is a positive counting number >= 1. Users need to be able to:

  1. train by specifying effective batch size and N
  2. find the largest batch size possible, and
  3. compute, for a given notional batch size, the largest effective batch size (and the smallest N) that satisfies (A).

Currently, effective batch size should be specified with --batch_size and N with --accumulate_grad_batches. That seems fine and satisfies (1).

In a perfect world, I think (2) would be done with a separate CLI tool called yoyodyne-tune maybe. It would inherit all, or nearly all, the flags from the trainer, and just call the tuner instead. (If this is done I can also move the --auto_lr-find tuner into this CLI tool too.) One issue in the design of (2) is how it can be integrated into the sweeping code. It is my contention that we would want to sample --max_batch_size rather than batch size during sweeps, apply (2) and (3), and then train with that. And that needs to be feasible. Any thoughts on this issue?

Yeah previously I had a script for running sweeps that does 3) when converting sweep hyperparameters to yoyodyne training. But to do this I set a --max_batch_size manually. To combine this with 2) don't we just need to call the lightning batch_size_finder and presumably it returns max_batch_size for us? Basically we would need to init a trainer with the batch size finder callback, cal fit on it, and then figure out how to parse the max batch size out of that interface (probably trivial though I did not see it in the docs).

As I understand it you're proposing putting (3) in (1) which makes it more general; i.e., you can use it without using (2), e.g. if you think you know what the answer to (2) is, and I think I agree. So then you'd just add --max_batch_size as a flag as you propose.

Yes!

I think yes, you'd stash this all in train.py unless you're creating a new binary, in which case you'd import a lot of the functionality from train.py (and probably lightly refactor it).

My implementation of the solver for (A) appears to work. I don't know if the brute force or elegant solver is better---but I can do a timing experiment.

One last thing, you may want to look at the PTL 2 interface to make sure it's not radically different.

For the batch size finder?

I think for consistency we may want to define tehm in the opposite, in general.

Yea, I realize these are not great terms. Redo them whatever way you think is best.

Yeah previously I had a script for running sweeps that does 3) when converting sweep hyperparameters to yoyodyne training. But to do this I set a --max_batch_size manually. To combine this with 2) don't we just need to call the lightning batch_size_finder and presumably it returns max_batch_size for us? Basically we would need to init a trainer with the batch size finder callback, cal fit on it, and then figure out how to parse the max batch size out of that interface (probably trivial though I did not see it in the docs).

Yeah, the question is whether you want a batch size finder and then parse its output, or pack both of them in the trainer.

One last thing, you may want to look at the PTL 2 interface to make sure it's not radically different.

For the batch size finder?

Yes.

commented

Yeah previously I had a script for running sweeps that does 3) when converting sweep hyperparameters to yoyodyne training. But to do this I set a --max_batch_size manually. To combine this with 2) don't we just need to call the lightning batch_size_finder and presumably it returns max_batch_size for us? Basically we would need to init a trainer with the batch size finder callback, cal fit on it, and then figure out how to parse the max batch size out of that interface (probably trivial though I did not see it in the docs).

Yeah, the question is whether you want a batch size finder and then parse its output, or pack both of them in the trainer.

Sorry so you mean should we combine 2) and 3) by subclassing the PTL trainer? I guess this factors into your comments about a separate cli yoyodyne-tune. Is this coupled to a particular trainer class? I can look into that. I had been thinking of this as simply adding a python function into something like train.py, which first calls the trainer to get max_batch_size (or optionally does not if we have requested a max_batch_size), and then sets args to the "real" trainer for accumulating gradients.

One last thing, you may want to look at the PTL 2 interface to make sure it's not radically different.

For the batch size finder?

Yes.

Ok will do.

Sorry so you mean should we combine 2) and 3) by subclassing the PTL trainer? I guess this factors into your comments about a separate cli yoyodyne-tune. Is this coupled to a particular trainer class? I can look into that. I had been thinking of this as simply adding a python function into something like train.py, which first calls the trainer to get max_batch_size (or optionally does not if we have requested a max_batch_size), and then sets args to the "real" trainer for accumulating gradients.

I don't think this needs to be done with subclassing. In PTL 2 it's now a callback passed to the trainer. In PTL 1.9 it is done with an argument to the trainer constructor. So I think either way you want to construct the trainer object before you find the size...

commented

Sorry so you mean should we combine 2) and 3) by subclassing the PTL trainer? I guess this factors into your comments about a separate cli yoyodyne-tune. Is this coupled to a particular trainer class? I can look into that. I had been thinking of this as simply adding a python function into something like train.py, which first calls the trainer to get max_batch_size (or optionally does not if we have requested a max_batch_size), and then sets args to the "real" trainer for accumulating gradients.

I don't think this needs to be done with subclassing. In PTL 2 it's now a callback passed to the trainer. In PTL 1.9 it is done with an argument to the trainer constructor. So I think either way you want to construct the trainer object before you find the size...

Right, I meant that I thought you were saying to combine 2) and 3) inside the trainer, which they do not already do as far I can tell. I can just try to add this to train.py by first calling the trainer with the batch size finder callback/arg, then setting batch_size and accumulation steps for the real trainer, and we can discuss more in a PR?

SGTM. I don't know what the right solution is yet.

STill catching up on PR, useful code may be how we implement transducers for asr. (We need to mini batch for the grad accumulation since it's such a memory hog.) https://github.com/NVIDIA/NeMo/blob/9218c3aab7af7c2d7f3d6e45c0b027bafe25eba8/nemo/collections/asr/modules/rnnt.py#L1420

Summarizing an offline conversation I had with @bonham79, I think the following is true:

  • It would be nice to have a way to compute max batch size without also training.
  • However, it is essential to have an ability to compute batch size within the trainer. If this isn't inside the trainer, it won't be possible to do this (repeatedly, as is necessary) during a sweep as the size of the model and/or desired batch size changes on each iteration of the sweep.
  • One obvious synthesis of this is to put the automatic batch size computation into the trainer; this will first be computed and logged. Then, if actual training isn't disabled (one might set --max_epochs 0 or someting like that if you don't really want to train), this will then, in combination with the desired batch size flag, compute and set up the gradient accumulation stuff, and then train under those settings.

Hmm, thinking about this. I think this may be a case of needing to add a wrapper to the default PTL trainer instead of stashing in train.py or subclassing. This should give more control during the Wandb sweeps and allow us to alias stuff. (Like, can pass a new flag --find-batch-size that just sets internal --max_epochs 0.) We'd probably would have to do this anyway. (Libraries I've been seeing keep needing to rewrite PTL anyhow.)

What are the explicit advantages to doing this with a "wrapper" as opposed to just rolling it into our train methods?

Most PTL code interactions use the trainer function as interface. So it would give us more control on how downstream apis and libraries engage with our models