klightz / Firefly

Offical Repo for Firefly Neural Architecture Descent: a General Approach for Growing Neural Networks. Accepted by Neurips 2020.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OutOfMemoryError in every run

vignesh99 opened this issue · comments

Hello, I am trying to run the code in this repository to replicate the results of your paper. I used a similar command to what was presented in the README python main.py --method fireflyn --model vgg19 --grow_ratio 0.3. However, I am consistently facing this error (tried for different grow_ratio and different n_elites)

Traceback (most recent call last):
  File "firefly20/main.py", line 222, in <module>
    run(trainset, trainloader, testloader, config)
  File "firefly20/main.py", line 173, in run
    n_neurons = model.split(config.method, trainset)
  File "irefly20/sp/net.py", line 116, in split
    split_fn[split_method](dataset, n_batches)
  File "firefly20/model.py", line 265, in spffn
    loss.backward()
  File "anaconda3/envs/torch/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "anaconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward 
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.24 GiB (GPU 0; 11.91 GiB total capacity; 3.50 GiB already allocated; 409.44 MiB free; 7.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 

It appears that the network is growing indefinitely. Is there any stopping criteria that I am supposed to enable? Please let me know how to fix the issue so that I can converge to a solution.

commented

Hi, the growing ratio controls the number of grown "neurons" but that does not necessary mean that we have a fixed increase in FLOPs, which may cause an indefinite FLOP increase. I believe the general solution will be using a GPU with larger GPU memory, or you can try to apply the model parallel or data parallel across multiple GPUs. Hope that helps!

Thanks for your response. I tried a GPU double the size of my earlier one, and am still facing same issues. I have a couple of questions to ask you:

Why does it take so much GPU memory when it is supposed to have lesser parameters than the original/full-sized VGG-19 network? The original VGG-19 runs without any "out of memory" issues. Am I missing something here?

It is in Round 9 of splitting and still running. I saw that there is a "load_round" parameter in the configs. Am I supposed to indicate how many rounds to run using that? Will the network not find out based on the growth_ratio how many rounds to run? Or the network runs until it is out of memory and we pick a particular round's network based on the number of parameters?

Can you please explain the answers to the above two questions? I want to execute and reproduce the results from your paper.

Ah interesting, thanks.

Another issue I am facing is loading the checkpoints that are saved. I want to do this so that I can evaluate the performance of the trained network.
If I do the following:

    net = Classifier(config).to(config.device)
    #Load network
    checkpoint = torch.load("checkpoint/roundfull_10_experiment_cifar10_fireflyn_initdim16_seed0_grow0.350000_gra3_alpha3_new.pt")
    #ckpt = torch.load("checkpoint/roundfull_%d_%s.pt" % (load_round, exp_name))
    net.load_state_dict(checkpoint)

Then I receive an error saying I am loading the standard VGG-19 network (64 to 512 filters), but my checkpoint consists of filters of specific sizes obtained by training (random depths like 309, 30 instead of 64 or 512). The example error is given below:

...size mismatch for net.7.bn.running_var: copying a param with shape torch
.Size([489]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for net.7.module.weight: copying a param with shape torch.
Size([489, 693, 3, 3]) from checkpoint, the shape in current model is torch.Size
([256, 256, 3, 3]).....

If I do what was written in some part of your main.py code (adding a .npy file to form stats array) which is shown below:

stats = np.load("checkpoint/roundfull_10_experiment_cifar10_fireflyn_initdim16_seed0_grow0.350000_gra3_alpha3_new.npy", allow_pickle=True)
    #print(stats)
    stats = stats.tolist()
    #Get the model using the config details
    net = Classifier(config,stats['cfg']).to(config.device)
    
    #Load network
    checkpoint = torch.load("checkpoint/roundfull_10_experiment_cifar10_fireflyn_initdim16_seed0_grow0.350000_gra3_alpha3_new.pt")
    #ckpt = torch.load("checkpoint/roundfull_%d_%s.pt" % (load_round, exp_name))
    net.load_state_dict(checkpoint)

I get the following error:

net = Classifier(config,stats['cfg']).to(config.device)
TypeError: __init__() takes 2 positional arguments but 3 were given

This is because the classifier __init__ function of Classifier is only defined with config as input, no stats[] anywhere.

As a result of these errors, I am unable to load the network architecture that was finalized, and in turn unable to load it with the trained weights and evaluate its performance. Please let me know how you load saved models for testing.

Hello @Cranial-XIX @klightz ,
There are 3 other issues I faced while reproducing your code. Please provide your feedback for the below 3 questions and the one above by this weekend, so that I can use your work for comparisons.

  • The starting point for VGG-19 backbone - which is all layers with 16 filters - is not less than 2% as indicated by the Figure 4a in the paper : The computation for the convolutional layers is shown below,

    complexity for the initial starting point: $C_{init} = 16\times16\times(3\times3)$
    complexity for the full-sized network: $C_{full} = [(2\times64) + (2\times128) + (4\times256) + (8\times512)]\times(3\times3)$
    Initial %(size of full model): $\displaystyle\frac{C_{init}}{C_{full}} = \frac{16\times16}{(2\times64) + (2\times128) + (4\times256) + (8\times512)} = \frac{256}{5504} = 4.7$%

Thus you can see that the initial point is $4.7$% and not less than $2$%. Can you please explain how you get the complexity of your network as $&lt;2$%? If you are considering the fully-connected layers in the estimation, please explain how by pointing to a section in the code (as it is not included in the paper)?

  • Accuracy is coming to be much lesser, in fact not at all beating the full scale network, as shown in the paper:
    I obtained the accuracy values for VGG-19 backbone. Some of the (% size of model, % accuracy) values are:
    $(5.7,74.76)$
    $(7.6,81.72)$
    $(10.1,84.87)$
    $(13.46,87.19)$
    If you look at the points, it appears to saturate, and does not reach the baseline accuracy of $92$%. But in the paper, the algorithm is said to achieve $92$% accuracy at around $4$% size of the full model.

I am using the following command to generate the output, python main.py --method fireflyn --model vgg19 --grow_ratio 0.1. And all the other parameters are same. Each network trains only for 10 epochs before growing again.
Please provide a way to fix the accuracy in order to obtain the plot shown in Figure 4a of the paper.

  • The ordering of the layers is out of place in the grown networks:
    This maybe a minor issue. But when the code gives the "Current cfg" output in which the max-pooling layers are one convolutional layer ahead. For example it gives the following output:
    [3, 17, 'M', 19, 25, 'M', 17, 19, 19, 18, 'M', 17, 19, 16, 16, 'M', 16, 16, 16, 16, 'M', 16]
    Instead of
    [1, 1, 'M', 2, 2, 'M', 4, 4, 4, 4, 'M', 8, 8, 8, 8, 'M', 8, 8, 8, 8, 'M']
    You can ignore the first '3' since it is the input dimension. If its layers arranged as in the former case, then it could be problem. This could also lead to the problem indicated in #1 (comment). Please clarify this as well.

Hello,

Regarding the first load issue, it seems to be caused by a version misalignment between the cleaned code and the original code. We plan to update this part soon. In the meantime, if you need a quick fix, you can replace Line 24 of model.py with the stats you are loading.

For the following questions, please note that the work is a bit old, so I will do my best to answer your questions based on the available information:

  1. If you look at Line 37 of model.py, the definition of {1, 2, 4, 8} in the original VGG19 refers to an expansion ratio of the channel size. Therefore, the layer width needs to be something like $(4 \times 64) \times 256 \times 3 \times 3$ for the full size VGG19, not $64 \times 256 \times 3 \times 3$. In our code, we directly calculate the parameter size using model.get_num_params(). If you still encounter a mismatch in the number of parameters, we can discuss it further.

  2. As stated in Appendix B.2, we increase the number of neurons by 30% each time and fine-tune the network for 160 epochs between consecutive increases. We did not observe any advantage to boosting the training speed, so we believe that the finetune epoch is the most critical factor affecting your results.

  3. Yes, the issue with the layer width is the same as the load issue and is caused by a version misalignment. You can ignore the first '3' and move the remaining layer widths to the correct positions.

Thank you for your feedback, and we will address the issues with our code shortly.

Thank you very much for helping me with your prompt response. I was able to train the network as you suggested in Appendix B2. I did see the improvement in accuracy. However, there is one small issue I am facing:

The complexity (%size of the model) is much smaller than what is shown in the paper. I used the get_num_params() function you suggested and found that for the initial starting point (all channels with 16 filters) is 0.18% of the full-model size, and the growth network that beats the VGG-19 clean accuracy ([66, 83, 'M', 69, 66, 'M', 84, 69, 84, 64, 'M', 78, 58, 48, 33, 'M', 28, 23, 28, 71, 'M']) is 2.56% of the full model size (instead of 4% of the model size as indicated in the paper).

Code used to generate the results:
python main.py --method fireflyn --model vgg19 --grow_ratio 0.3

Complexity results obtained:

$$\displaystyle \frac{C_{init}} {C_{full}} = \frac{35674}{20035018} = 0.18 \text{ percentage%}$$

$$\displaystyle \frac{C_{growth}} {C_{full}} = \frac{511601}{20035018} = 2.56 \text{ percentage}$$

Please let me know where I am going wrong. I am probably making a simple mistake somewhere, but not sure where.

EDIT: P.S. Can you also let me know if you have used any fully-connected layers at the end of the convolutional layers (since they are a part of VGG-19)? I could not find them in your code

Hello! I believe the lower percentage result might be due to a versioning issue with certain packages (such as pytorch) since this project is from three years ago. In my current environment, I re-ran the project and obtained a percentage close to 3%, which seems reasonable compared to the original results of 4% and 2.56%.

Regarding our network setup, we utilized global pooling with a linear layer after the convolutional layers. We believe this to be a more efficient and standard approach for later classification networks.

Thanks a lot for checking the simulations by running from your end as well.

So you mean to say that with higher versions of PyTorch, the layers are more efficient? Since the architecture is consistent (across versions/time), even by hand computation I got ~2.56%. I would like to get to the bottom of why the results have improved from your paper.

So finally, should I take 2.56% as the result (at which you beat the full-size accuracy) of your paper moving forward? I want to know if I am correctly reproducing your results so I can build on this.

EDIT: Lastly, is there any drawback that you observe when using FireFly-grown network compared to the full-size model, apart from the increase in training time? Since you are achieving reduction in the number of parameters, are you compensating somewhere else?

It's okay to report a growth rate of 2.56%. However, it's important to consider that if the growth ratio is small and the growth step is large, the quality of growth may decrease over time. This is because each growth iteration can introduce some error, which accumulates throughout the process. To mitigate this, it is advisable to use a larger growth ratio and allocate longer training time between each interval.