f-dangel / backpack

Hi,

Thanks for this great library. I am wondering how can we compute the gradient of sum of gradient?

I am trying to implement the Hessian-Vector-Product (HVP) with the following code:

def batch_hvp(self, model, loss, params_list, batch_grad_list):
    if len(params_list) != len(batch_grad_list):
        raise (ValueError("w and v must have the same length."))

    one_sample_grad_list = grad(loss, params_list, retain_graph=True, create_graph=True)

    elemwise_products = 0
    for grad_elem, v_elem in zip(one_sample_grad_list, batch_grad_list):
        sum_over_dims = []
        for i in range(len(v_elem.shape)):
            sum_over_dims.append(i)
        sum_over_dims = tuple(sum_over_dims[1:])
        elemwise_products += torch.sum(grad_elem.unsqueeze(0) * v_elem.detach(), sum_over_dims)

    with backpack(BatchGrad()):
        elemwise_products.backward()   # problem: has no attribute 'input0'
        return_grads = [p.grad_batch for p in model.parameters() if p.requires_grad]

    return return_grads

I encounter the " has no attribute 'input0' " problem when I call the backward(), is it possible to get batch gradients for return_grads?

For now, I am only using the for loop to compute the gradient.

def batch_hvp(self, model, loss, params_list, batch_grad_list):
    if len(params_list) != len(batch_grad_list):
        raise (ValueError("w and v must have the same length."))

    one_sample_grad_list = grad(loss, params_list, retain_graph=True, create_graph=True)

    elemwise_products = 0
    for grad_elem, v_elem in zip(one_sample_grad_list, batch_grad_list):
        sum_over_dims = []
        for i in range(len(v_elem.shape)):
            sum_over_dims.append(i)
        sum_over_dims = tuple(sum_over_dims[1:])
        elemwise_products += torch.sum(grad_elem.unsqueeze(0) * v_elem.detach(), sum_over_dims)
    
    # The for-loop version 
    grad_cache = []
    for i in range(elemwise_products.shape[0]):
        elemwise_products[i].backward(retain_graph=True)
        grad_cache.append([p.grad.clone() for p in model.parameters() if p.requires_grad])
    grad_cache = list(zip(*grad_cache))
    return_grads = []
    for l_id in range(len(grad_cache)):
        return_grads.append(torch.cat([g.unsqueeze(0) for g in grad_cache[l_id]], dim=0))

    return return_grads

Thanks in advance!

Hi,
thanks for providing a detailed description of the problem. I have been able to reproduce the issue.
The issue is that BackPACK does ignore retain_graph. In the first backward pass, it deletes its internal variables input0 and output. Those are missing for the second backward pass.
For now, you can try the following workarounds:

If the model is deterministic (no BatchNorm, Dropout, ... or in evaluation mode), you can do a second forward pass.

one_sample_grad_list = grad(loss, params_list, retain_graph=True, create_graph=True)  # first backward
_ = loss_fn(model(inputs), targets)  # second forward to make backpack save its internal variables

Manually prevent deletion of BackPACK's quantities.
Change

backpack/backpack/__init__.py

Line 215 in afc94f1

memory_cleanup(module)

to

    pass  # memory_cleanup(module)

Wait until BackPACK supports retain_graph.
I already tried to properly implement it. My main issue is, I need to request the value of retain_graph from PyTorch from within the backward hook. However, I did not find it. I was expecting something along the lines of torch.autograd.global_variables.retain_graph. If you have an idea, please say it.

I hope this clarified things a bit.

Hi,

Thank you very much. I think the second method is better. After I remove memory_cleanup, I just need to manually call memory_cleanup by model.apply(memory_cleanup), for other place where I am using backpack to get grad_batch().

Thanks!

Hi @haonan3,

did you verify whether your results are in accordance with the Batch-wise HVPs obtained via a for-loop?

Hi again,

I think your approach for computing IVHPs does not work correctly whenever a parameter is used not only in the forward pass, but also in the gradient computation. Here is an example.

Best,
Felix

Hi @f-dangel,

Thanks for your comments. I didn't check whether the result obtained by the Batch-wise HVPs is equal to the result obtained by for-loop or not. By manually manipulate the memory_cleanup, I can get the correct shape (again, I didn't check the value). But I think I will try to use the method mentioned in the example.

Thanks again for your comments!

Best,
Haonan

Hi again,

just to make sure: Your approach via BatchGrad will produce an incorrect result (even though the shapes might be correct). Use Hessian-vector products via autograd to compute correct IHVPs.

Best,
Felix

Hi Felix,

Thanks for the great example. Does it mean we cannot compute HVP in batch-mode now? According to the example, if there are more than one layer, then the batch-HVP will be different from the for-loop-HVP.

Best,
Haonan

Yes, that's correct. In general you cannot compute batch-HVPs with BackPACK.

Based on the example, I find a method (not fully batchwise) which can compute hvp faster than the for-loop-version. Besides, this method can provide the same results as the the for-loop one.

def batch_hvp_v2(model, loss_func, x, y, v_list):
    model.zero_grad()
    params_list = [p for p in model.parameters() if p.requires_grad]
    if len(params_list) != len(v_list):
        raise (ValueError("w and v must have the same length."))

    loss = loss_func(model(x), y)
    with backpack(BatchGrad(), retain_graph=True):
        loss.backward(retain_graph=True, create_graph=True)
        one_sample_grad_list = [p.grad_batch for p in params_list]
    model.zero_grad()

    elemwise_products = 0
    for grad_elem, v_elem in zip(one_sample_grad_list, v_list):
        v_elem = v_elem.unsqueeze(0)
        elemwise_products = elemwise_products+ (grad_elem * v_elem).reshape(grad_elem.shape[0], -1).sum(dim=-1)# torch.sum(grad_elem * v_elem)

    return_grads_cache = []
    for i in elemwise_products:
        i.backward(retain_graph=True)
        temp_grads = [p.grad.clone() for p in model.parameters() if p.requires_grad]
        return_grads_cache.append(temp_grads)
        model.zero_grad()

    return_grads = []
    num_layer = len(params_list)
    for l in range(num_layer):
        temp_grad_list = []
        for temp_grads in return_grads_cache:
            temp_grad_list.append(temp_grads[l].unsqueeze(0))
        return_grads.append(torch.cat(temp_grad_list, dim=0))

    return return_grads

Hey, I did not look at your code in detail, but in principle it should be possible to perform the first backward pass for individual gradients with BackPACK, then use the individual gradients in a for-loop to obtain IHVPs via a second backward pass. Please make sure to double-check everything with autograd.

Best,
Felix

[Not working] Individual Hessian-vector products with `BatchGrad`