I'm getting a blank graph

Question

I'm getting a blank graph

Shreeyak opened this issue 4 years ago · comments

I running a semantic segmentation model, Deeplabv3+ with a modified CrossEntropyLoss and either SGD or Adam optimizer.
When I run the LRFinder, I get a blank graph. No losses seen. Even though I printed the losses and the criterion is def returning valid values.

Sweeping across start_lr = 1e-07 and end_lr = 0.0001
  0%|                                                                                                                          | 0/10 [00:00<?, ?it/s]
loss:  tensor(89984., device='cuda:0', grad_fn=<DivBackward0>)
 10%|███████████▍                                                                                                      | 1/10 [00:06<00:54,  6.01s/it]
loss:  tensor(1588043.6250, device='cuda:0', grad_fn=<DivBackward0>)
 20%|██████████████████████▊                                                                                           | 2/10 [00:09<00:40,  5.12s/it]
loss:  tensor(420687.0938, device='cuda:0', grad_fn=<DivBackward0>)
 30%|██████████████████████████████████▏                                                                               | 3/10 [00:12<00:31,  4.50s/it]
loss:  tensor(653955.4375, device='cuda:0', grad_fn=<DivBackward0>)
 40%|█████████████████████████████████████████████▌                                                                    | 4/10 [00:15<00:24,  4.07s/it]
loss:  tensor(141592.6875, device='cuda:0', grad_fn=<DivBackward0>)
 50%|█████████████████████████████████████████████████████████                                                         | 5/10 [00:18<00:18,  3.76s/it]
loss:  tensor(97450.2891, device='cuda:0', grad_fn=<DivBackward0>)
 60%|████████████████████████████████████████████████████████████████████▍                                             | 6/10 [00:21<00:14,  3.55s/it]
loss:  tensor(160497.9375, device='cuda:0', grad_fn=<DivBackward0>)
 70%|███████████████████████████████████████████████████████████████████████████████▊                                  | 7/10 [00:24<00:10,  3.44s/it]
loss:  tensor(151121.3594, device='cuda:0', grad_fn=<DivBackward0>)
 80%|███████████████████████████████████████████████████████████████████████████████████████████▏                      | 8/10 [00:27<00:06,  3.38s/it]
loss:  tensor(123211.6484, device='cuda:0', grad_fn=<DivBackward0>)
 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████▌           | 9/10 [00:31<00:03,  3.40s/it]
loss:  tensor(98576.7578, device='cuda:0', grad_fn=<DivBackward0>)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:34<00:00,  3.43s/it]
Learning rate search finished. See the graph with {finder_name}.plot()

Lemme know what other details I can attach.

My criterion:

def cross_entropy2d(logit, target, ignore_index=255, weight=None, batch_average=True):
    """
    The loss is

    .. math::
        \sum_{i=1}^{\\infty} x_{i}

        `(minibatch, C, d_1, d_2, ..., d_K)`

    Args:
        logit (Tensor): Output of network
        target (Tensor): Ground Truth
        ignore_index (int, optional): Defaults to 255. The pixels with this labels do not contribute to loss
        weight (List, optional): Defaults to None. Weight assigned to each class
        batch_average (bool, optional): Defaults to True. Whether to consider the loss of each element in the batch.

    Returns:
        Float: The value of loss.
    """

    n, c, h, w = logit.shape
    target = target.squeeze(1)

    if weight is None:
        criterion = nn.CrossEntropyLoss(weight=weight, ignore_index=ignore_index, reduction='sum')
    else:
        criterion = nn.CrossEntropyLoss(weight=torch.tensor(weight, dtype=torch.float32),
                                        ignore_index=ignore_index,
                                        reduction='sum')

    loss = criterion(logit, target.long())

    if batch_average:
        loss /= n

    return loss

Shreeyak · Answer 1 · Sat Jun 27 2020 16:58:54 GMT+0800 (China Standard Time)

Shreeyak commented 4 years ago

Shreeyak · Answer 2 · Sat Jun 27 2020 17:04:01 GMT+0800 (China Standard Time)

I even tried running with the default CrossEntropyLoss that gives loss values <1. Still a blank graph:

def cross_entropy2d_lrfinder(logit, target, ignore_index=255, weight=None, batch_average=True):
    criterion = nn.CrossEntropyLoss()
    loss = criterion(logit, target.long())

    print('loss: ', loss)
    return loss

Sweeping across start_lr = 1e-07 and end_lr = 0.0001
  0%|                                                                                                                           | 0/3 [00:00<?, ?it/s]
loss:  tensor(0.7474, device='cuda:0', grad_fn=<NllLoss2DBackward>)
 33%|██████████████████████████████████████▎                                                                            | 1/3 [00:05<00:11,  5.54s/it]
loss:  tensor(0.7463, device='cuda:0', grad_fn=<NllLoss2DBackward>)
 67%|████████████████████████████████████████████████████████████████████████████▋                                      | 2/3 [00:08<00:04,  4.79s/it]
loss:  tensor(0.7460, device='cuda:0', grad_fn=<NllLoss2DBackward>)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:11<00:00,  3.88s/it]
Learning rate search finished. See the graph with {finder_name}.plot()

Nale Raphael · Answer 3 · Sat Jun 27 2020 17:05:45 GMT+0800 (China Standard Time)

Hi @Shreeyak

That's quite weird, can you try to print out lr_finder.history to see whether there is any value recorded?

Shreeyak · Answer 4 · Sat Jun 27 2020 17:13:32 GMT+0800 (China Standard Time)

Yes, thanks!
Here's the output of lr_finder.history from a short run:

{'lr': [9.999999999999997e-06, 0.0001, 0.0009999999999999996], 'loss': [47049.02734375, 47058.836328125, 47008.379277343745]}

I'm using a conda env, with pytorch 1.5

torch                     1.5.0                    pypi_0    pypi
torch-lr-finder           0.1.5                    pypi_0    pypi
torchvision               0.6.0                    pypi_0    pypi

Nale Raphael · Answer 5 · Sat Jun 27 2020 17:24:09 GMT+0800 (China Standard Time)

It seems losses are recorded properly. 🤔

Back to the original post, is the argument num_iter in lr_finder.range_test() not large enough to make it able to be plotted? Because there are 2 default arguments skip_start=10 and skip_end=5 in lr_finder.plot(). You can try to set them both to 0 and re-plot again.

Nale Raphael · Answer 6 · Sat Jun 27 2020 17:29:45 GMT+0800 (China Standard Time)

Oh, I probably figure it out.

As the reason mentioned in my previous comment, there are 2 default arguments skip_start=10 and skip_end=5 in lr_finder.plot(), which means num_iter used in lr_finder.range_test() should be at least 15 (skip_start + skip_end). Otherwise, there won't be any available values to be plotted after the history is trimmed.
And in the original post, it seems num_iter is only 10 (speculated from the progress bar). So that's probably the reason why there is nothing plotted in the graph.

Shreeyak · Answer 7 · Sat Jun 27 2020 17:43:09 GMT+0800 (China Standard Time)

That's odd. Your suggestion worked (thanks!), but there are 2 issues:

I'm sure I got a blank graph when I ran initially with num_iter=100. Lemme run again and get back to you on that - I just reverted some changes that I'd made when reporting this error.
I passed in a range start_lr=1e-7 and end_lr=1e-4, but the graph is printing loss values for 10e-5, 10e-4 and 10e-3 . Why not start with 10e-6?
How do I get the lr_finder to run multiple batches for each "iteration". I'd assumed that num_iter would control the number of batches/iterations for each value of lr in the given range.

Shreeyak · Answer 8 · Sat Jun 27 2020 18:00:44 GMT+0800 (China Standard Time)

Uh, now passing num_iter=100 is giving me a graph, with and without passing skip_start and skip_end to lr_finder.plot(). Dunno, maybe I'd messed something up in my prev run. Here's fig for both runs:

Thanks for the fast response and resolution.

Nale Raphael · Answer 9 · Sat Jun 27 2020 18:19:38 GMT+0800 (China Standard Time)

Okay, glad it's resolved. 😊
Feel free to reopen this issue if the same error occurs again.