Trouble getting ClassNLLCriterion to work

Question

Trouble getting ClassNLLCriterion to work

PavleMiha opened this issue 9 years ago · comments

Pavle Mihajlovic commented 9 years ago

Hey,

First of all thanks for the great work on this, this project has been really helpful.

I have been struggling to get clnn to run StochasticGradient with ClassNLLCriterion.

I have been following this guide https://github.com/soumith/cvpr2015/blob/master/Deep%20Learning%20with%20Torch.ipynb

My first question is that regular nn's ClassNLLCriterion seems fine accepting 1D tensors, while clnn's needs 2D ones. I tried adjusting for this by adding a "net:add(nn.Reshape(10, 1))" as the last step of my neural network. Is this the correct approach?

Also, torch's ClassNLLCriterion accepts integers as the target, while, if I understand correctly, clnn's requires tensors with the correct label set to 1. I've converted the targets to be 1D vectors of 0's with a 1 on the correct label.

With these changes I can get it to run but I'm getting nonsensical error numbers when training (sometimes nan, sometimes impossibly high values).

Here's the full code. It works when not running through OpenCL (it gets a training error of 1.432 in about 60 seconds), so I think I did something wrong, but I can't figure out what.

require('nn')
require('cltorch')
require('clnn')

-- os.execute('wget -c https://s3.amazonaws.com/torch7/data/cifar10torchsmall.zip')
-- os.execute('unzip cifar10torchsmall.zip')

net = nn.Sequential()

net:add(nn.SpatialConvolutionMM(3, 6, 5, 5))
net:add(nn.SpatialMaxPooling(2,2,2,2))
net:add(nn.SpatialConvolutionMM(6, 16, 5, 5))
net:add(nn.SpatialMaxPooling(2,2,2,2))
net:add(nn.View(16*5*5))
net:add(nn.Linear(16*5*5, 120))
net:add(nn.Linear(120, 84))
net:add(nn.Linear(84, 10))
net:add(nn.LogSoftMax())
net:add(nn.Reshape(10, 1))

net = net:cl()


trainset = torch.load('cifar10-train.t7')
trainset.data = trainset.data:double()

testset = torch.load('cifar10-test.t7')
testset.data = testset.data:double()


mean = {} -- store the mean, to normalize the test set in the future
stdv  = {} -- store the standard-deviation for the future
for i=1,3 do -- over each image channel
    mean[i] = trainset.data[{ {}, {i}, {}, {}  }]:mean() -- mean estimation
    print('Channel ' .. i .. ', Mean: ' .. mean[i])
    trainset.data[{ {}, {i}, {}, {}  }]:add(-mean[i]) -- mean subtraction

    stdv[i] = trainset.data[{ {}, {i}, {}, {}  }]:std() -- std estimation
    print('Channel ' .. i .. ', Standard Deviation: ' .. stdv[i])
    trainset.data[{ {}, {i}, {}, {}  }]:div(stdv[i]) -- std scaling
end

for i=1,3 do -- over each image channel
    testset.data[{ {}, {i}, {}, {}  }]:add(-mean[i]) -- mean subtraction    
    testset.data[{ {}, {i}, {}, {}  }]:div(stdv[i]) -- std scaling
end


trainset.data = trainset.data:cl()

setmetatable(trainset, 
    {__index = function(t, i) 
        return {t.data[i], t.label[i]} 
    end}
);

function trainset:size() 
    return self.data:size(1) 
end

local labels = trainset.label

trainset.label = torch.Tensor(trainset.label:size(1), 10)

for i=1,trainset:size() do
    trainset.label[i] = torch.Tensor(10):fill(0)
    trainset.label[i][labels[i]] = 1
end

trainset.label = trainset.label:cl()

print(trainset)
--print(net:forward(trainset.data[4]))


criterion = nn.ClassNLLCriterion()
criterion = criterion:cl()


trainer = nn.StochasticGradient(net, criterion)
trainer.learningRate = 0.001
trainer.maxIteration = 2


trainer:train(trainset)

If the differences between clnn's ClassNLLCriterion and nn's version are unintentional I'd love to help consolidate the interfaces.

Hugh Perkins · Answer 1 · Tue Nov 10 2015 21:15:33 GMT+0800 (China Standard Time)

Hi PavleMiha,

So basically, normally one would tend to train in mini-batches, for various reasons, but for GPUs, it will be significantly faster. Otherwise, the program spends all its time starting and stopping kernels, and the GPU doesnt have enough data to do what it does well, which is crunching tens of thousands of numbers at a time.

To create a mini-batch, you add one more dimension, as the first dimension to each example tensor. If you have a mini-batch with a batch-size of 128, then the first dimension of each example tensor would be 128. At the the output of your network, just before the criterion, the tensors would have dimension:

output: 128 x 10
target: 128 (ie a vector with length 128)

Now, if you train with a mini-batch size of 1 ( which is possible, though slow), you would need the following tensor sizes:

output: 1 x 10
target: 1 (ie a vector with length 1)

So, the reshape you did was along the right lines, but the dimensions need to be swapped, like this:

net:add(nn.Reshape(1, 10))

For the target, the labels, they should be simply the correct class for each example, eg 5, or 10, but they also need to be reshaped into a mini-batch, in this case a mini-batch with a size of 1. But the trainset.label is being indexed to give each 'mini-batch', ie:

setmetatable(trainset, 
    {__index = function(t, i) 
        return {t.data[i], t.label[i]} 
    end}
);

So, we need to reshape as follows:

trainset.label = trainset.label:reshape(trainset.label:size(1), 1)

And then this runs ok for me.

Here is the test program, modified as above. I added a parameter to it, so you can run it with -backend nn or -backend clnn:

require('nn')
require('cltorch')
require('clnn')

local cmd = torch.CmdLine()
cmd:option('-backend', 'nn')
local params = cmd:parse(arg)

local backend = params.backend
if backend ~= 'nn' and backend ~= 'clnn' then
    error('backend should be nn or clnn')
end

-- os.execute('wget -c https://s3.amazonaws.com/torch7/data/cifar10torchsmall.zip')
-- os.execute('unzip cifar10torchsmall.zip')

net = nn.Sequential()

net:add(nn.SpatialConvolutionMM(3, 6, 5, 5))
net:add(nn.SpatialMaxPooling(2,2,2,2))
net:add(nn.SpatialConvolutionMM(6, 16, 5, 5))
net:add(nn.SpatialMaxPooling(2,2,2,2))
net:add(nn.View(16*5*5))
net:add(nn.Linear(16*5*5, 120))
net:add(nn.Linear(120, 84))
net:add(nn.Linear(84, 10))
net:add(nn.LogSoftMax())
net:add(nn.Reshape(1, 10))

if params.backend == 'clnn' then
    net = net:cl()
end

trainset = torch.load('cifar10-train.t7')
trainset.data = trainset.data:double()

testset = torch.load('cifar10-test.t7')
testset.data = testset.data:double()


mean = {} -- store the mean, to normalize the test set in the future
stdv  = {} -- store the standard-deviation for the future
for i=1,3 do -- over each image channel
    mean[i] = trainset.data[{ {}, {i}, {}, {}  }]:mean() -- mean estimation
    print('Channel ' .. i .. ', Mean: ' .. mean[i])
    trainset.data[{ {}, {i}, {}, {}  }]:add(-mean[i]) -- mean subtraction

    stdv[i] = trainset.data[{ {}, {i}, {}, {}  }]:std() -- std estimation
    print('Channel ' .. i .. ', Standard Deviation: ' .. stdv[i])
    trainset.data[{ {}, {i}, {}, {}  }]:div(stdv[i]) -- std scaling
end

for i=1,3 do -- over each image channel
    testset.data[{ {}, {i}, {}, {}  }]:add(-mean[i]) -- mean subtraction    
    testset.data[{ {}, {i}, {}, {}  }]:div(stdv[i]) -- std scaling
end


if params.backend == 'clnn' then
    trainset.data = trainset.data:cl()
end

setmetatable(trainset, 
    {__index = function(t, i) 
        return {t.data[i], t.label[i]} 
    end}
);

function trainset:size() 
--    return self.data:size(1) 
  return 128
end

local labels = trainset.label

--trainset.label = torch.Tensor(1, trainset.label:size(1))
trainset.label = trainset.label:reshape(trainset.label:size(1), 1)

--for i=1,trainset:size() do
--    trainset.label[i] = labels[i]
--end

print('trainset.label:size()', trainset.label:size())
if params.backend == 'clnn' then
    trainset.label = trainset.label:cl()
end

print(trainset)
--print(net:forward(trainset.data[4]))


criterion = nn.ClassNLLCriterion()
if params.backend == 'clnn' then
    criterion = criterion:cl()
end


trainer = nn.StochasticGradient(net, criterion)
trainer.learningRate = 0.001
trainer.maxIteration = 2


trainer:train(trainset)

This runs for me, with both nn and clnn backends. Note that you will generally get faster training times if you can arrange a mini-batch size of around 128, or 256, rather than 1. For example, in vgg, they use 256 http://www.robots.ox.ac.uk/~vgg/publications/2015/Simonyan15/simonyan15.pdf (third sentence, section 3.1) . In alexnet he uses 128 http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf ( first sentence of section 5) .

Pavle Mihajlovic · Answer 2 · Thu Nov 12 2015 08:23:48 GMT+0800 (China Standard Time)

Hey Hugh,

Thank you so much for the help, batching to run on the GPU makes perfect sense.

Your code worked perfectly but I wanted to try to make it actually use batching so I changed some stuff so it accepts a batch size. I couldn't get trainer to accept batches (it might not do that at all) so I did it with optim, based on what I saw at https://github.com/torch/demos/blob/master/train-a-digit-classifier/train-on-mnist.lua

Here's the code, with batching, for reference: https://github.com/PavleMiha/torchdemos/blob/master/ir3.lua

The weird thing is that I'm finding that the nn and clnn variants are running at pretty much exactly the same speed, regardless of batch size, or the size of the convolutional layers.

Is this just a matter of the neural net and my machine (I'm on a 2011 iMac, so it's an AMD Radeon HD 6770M) or did I mess something up?

Hugh Perkins · Answer 3 · Thu Nov 12 2015 19:38:16 GMT+0800 (China Standard Time)

Hmmm, yes, I get similar results actually. I think it's because the images are quite small. I think if you try larger images, then you might see a larger difference. Note that in imagenet and so on, most of the time goes into the first few layers, eg see the 'layerwise' results in https://github.com/soumith/convnet-benchmarks (cltorch isnt in layerwise, but can see that the earlier layers take a lot more time than the later ones).