Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Cuda with train-a-digit-classifier #43

Open
yyaddaden opened this issue May 2, 2016 · 7 comments
Open

Using Cuda with train-a-digit-classifier #43

yyaddaden opened this issue May 2, 2016 · 7 comments

Comments

@yyaddaden
Copy link

Hi,

I have used the train-a-digit-classifier in a CPU mode and it is worked well. but now I want to test it in a GPU mode. I have a NVIDIA JETSON TK1 where I have installed CUDA 6.5 and all other prerequisites. I have also installed Torch7 and the two packages: Cutorch and Cunn.

In some tutorials, they say that for using the GPU mode with CUDA, there are only some lines of code to add:

require 'cunn'
In order to use CUDA

model:cuda()
to convert the nn to CUDA

But when I run: qlua train-on-mnist.lua I get some errors. Can you help me ?

Regards.

@yyaddaden
Copy link
Author

yyaddaden commented May 3, 2016

The problem is resolved, to access to the answer: http://stackoverflow.com/q/36992803/6091401

You can close the issue. Thank you.

@amiltonwong
Copy link

For reference, here is the complete code:

----------------------------------------------------------------------
-- This script shows how to train different models on the MNIST 
-- dataset, using multiple optimization techniques (SGD, LBFGS)
--
-- This script demonstrates a classical example of training 
-- well-known models (convnet, MLP, logistic regression)
-- on a 10-class classification problem. 
--
-- It illustrates several points:
-- 1/ description of the model
-- 2/ choice of a loss function (criterion) to minimize
-- 3/ creation of a dataset as a simple Lua table
-- 4/ description of training and test procedures
--
-- Clement Farabet
----------------------------------------------------------------------

require 'torch'
require 'nn'
require 'nnx'
require 'optim'
require 'image'
require 'dataset-mnist'
require 'pl'
require 'paths'
require 'cunn'
require 'cutorch'
----------------------------------------------------------------------
-- parse command-line options
--
local opt = lapp[[
   -s,--save          (default "logs")      subdirectory to save logs
   -n,--network       (default "")          reload pretrained network
   -m,--model         (default "convnet")   type of model tor train: convnet | mlp | linear
   -f,--full                                use the full dataset
   -p,--plot                                plot while training
   -o,--optimization  (default "SGD")       optimization: SGD | LBFGS 
   -r,--learningRate  (default 0.05)        learning rate, for SGD only
   -b,--batchSize     (default 10)          batch size
   -m,--momentum      (default 0)           momentum, for SGD only
   -i,--maxIter       (default 3)           maximum nb of iterations per batch, for LBFGS
   --coefL1           (default 0)           L1 penalty on the weights
   --coefL2           (default 0)           L2 penalty on the weights
   -t,--threads       (default 4)           number of threads
]]

-- fix seed
torch.manualSeed(1)

-- threads
torch.setnumthreads(opt.threads)
print('<torch> set nb of threads to ' .. torch.getnumthreads())

-- use floats, for SGD
if opt.optimization == 'SGD' then
   torch.setdefaulttensortype('torch.FloatTensor')
end

-- batch size?
if opt.optimization == 'LBFGS' and opt.batchSize < 100 then
   error('LBFGS should not be used with small mini-batches; 1000 is recommended')
end

----------------------------------------------------------------------
-- define model to train
-- on the 10-class classification problem
--
classes = {'1','2','3','4','5','6','7','8','9','10'}

-- geometry: width and height of input images
geometry = {32,32}

if opt.network == '' then
   -- define model to train
   model = nn.Sequential()

   if opt.model == 'convnet' then
      ------------------------------------------------------------
      -- convolutional network 
      ------------------------------------------------------------
      -- stage 1 : mean suppresion -> filter bank -> squashing -> max pooling
      model:add(nn.SpatialConvolutionMM(1, 32, 5, 5))
      model:add(nn.Tanh())
      model:add(nn.SpatialMaxPooling(3, 3, 3, 3, 1, 1))
      -- stage 2 : mean suppresion -> filter bank -> squashing -> max pooling
      model:add(nn.SpatialConvolutionMM(32, 64, 5, 5))
      model:add(nn.Tanh())
      model:add(nn.SpatialMaxPooling(2, 2, 2, 2))
      -- stage 3 : standard 2-layer MLP:
      model:add(nn.Reshape(64*3*3))
      model:add(nn.Linear(64*3*3, 200))
      model:add(nn.Tanh())
      model:add(nn.Linear(200, #classes))
      ------------------------------------------------------------

   elseif opt.model == 'mlp' then
      ------------------------------------------------------------
      -- regular 2-layer MLP
      ------------------------------------------------------------
      model:add(nn.Reshape(1024))
      model:add(nn.Linear(1024, 2048))
      model:add(nn.Tanh())
      model:add(nn.Linear(2048,#classes))
      ------------------------------------------------------------

   elseif opt.model == 'linear' then
      ------------------------------------------------------------
      -- simple linear model: logistic regression
      ------------------------------------------------------------
      model:add(nn.Reshape(1024))
      model:add(nn.Linear(1024,#classes))
      ------------------------------------------------------------

   else
      print('Unknown model type')
      cmd:text()
      error()
   end
else
   print('<trainer> reloading previously trained network')
   model = torch.load(opt.network)
end




-- retrieve parameters and gradients
parameters,gradParameters = model:getParameters()

-- verbose
print('<mnist> using model:')
print(model)

----------------------------------------------------------------------
-- loss function: negative log-likelihood
--
model:add(nn.LogSoftMax())
criterion = nn.ClassNLLCriterion()

-- convert into cuda version
model = model:cuda()
-- convert into cuda version
criterion = criterion:cuda()

----------------------------------------------------------------------
-- get/create dataset
--
if opt.full then
   nbTrainingPatches = 60000
   nbTestingPatches = 10000
else
   nbTrainingPatches = 2000
   nbTestingPatches = 1000
   print('<warning> only using 2000 samples to train quickly (use flag -full to use 60000 samples)')
end

-- create training set and normalize
trainData = mnist.loadTrainSet(nbTrainingPatches, geometry)
trainData:normalizeGlobal(mean, std)
trainData.data = trainData.data:cuda()
trainData.labels = trainData.labels:cuda()

-- create test set and normalize
testData = mnist.loadTestSet(nbTestingPatches, geometry)
testData:normalizeGlobal(mean, std)
testData.labels = testData.labels:cuda()

----------------------------------------------------------------------
-- define training and testing functions
--

-- this matrix records the current confusion across classes
confusion = optim.ConfusionMatrix(classes)

-- log results to files
trainLogger = optim.Logger(paths.concat(opt.save, 'train.log'))
testLogger = optim.Logger(paths.concat(opt.save, 'test.log'))

-- training function
function train(dataset)
   -- epoch tracker
   epoch = epoch or 1

   -- local vars
   local time = sys.clock()

   -- do one epoch
   print('<trainer> on training set:')
   print("<trainer> online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. ']')
   for t = 1,dataset:size(),opt.batchSize do
      -- create mini batch
      local inputs = torch.Tensor(opt.batchSize,1,geometry[1],geometry[2])
      local targets = torch.Tensor(opt.batchSize)
      local k = 1
      for i = t,math.min(t+opt.batchSize-1,dataset:size()) do
         -- load new sample
         local sample = dataset[i]
         local input = sample[1]:clone()
         local _,target = sample[2]:clone():max(1)
         target = target:squeeze()
         inputs[k] = input
         targets[k] = target
         k = k + 1
      end

      -- create closure to evaluate f(X) and df/dX
      local feval = function(x)
         -- just in case:
         collectgarbage()

         -- get new parameters
         if x ~= parameters then
            parameters:copy(x)
         end

         -- reset gradients
         gradParameters:zero()

         -- evaluate function for complete mini batch
         local outputs = model:forward(inputs)
         local f = criterion:forward(outputs, targets)

         -- estimate df/dW
         local df_do = criterion:backward(outputs, targets)
         model:backward(inputs, df_do)

         -- penalties (L1 and L2):
         if opt.coefL1 ~= 0 or opt.coefL2 ~= 0 then
            -- locals:
            local norm,sign= torch.norm,torch.sign

            -- Loss:
            f = f + opt.coefL1 * norm(parameters,1)
            f = f + opt.coefL2 * norm(parameters,2)^2/2

            -- Gradients:
            gradParameters:add( sign(parameters):mul(opt.coefL1) + parameters:clone():mul(opt.coefL2) )
         end

         -- update confusion
         for i = 1,opt.batchSize do
            confusion:add(outputs[i], targets[i])
         end

         -- return f and df/dX
         return f,gradParameters
      end

      -- optimize on current mini-batch
      if opt.optimization == 'LBFGS' then

         -- Perform LBFGS step:
         lbfgsState = lbfgsState or {
            maxIter = opt.maxIter,
            lineSearch = optim.lswolfe
         }
         optim.lbfgs(feval, parameters, lbfgsState)

         -- disp report:
         print('LBFGS step')
         print(' - progress in batch: ' .. t .. '/' .. dataset:size())
         print(' - nb of iterations: ' .. lbfgsState.nIter)
         print(' - nb of function evalutions: ' .. lbfgsState.funcEval)

      elseif opt.optimization == 'SGD' then

         -- Perform SGD step:
         sgdState = sgdState or {
            learningRate = opt.learningRate,
            momentum = opt.momentum,
            learningRateDecay = 5e-7
         }
         optim.sgd(feval, parameters, sgdState)

         -- disp progress
         xlua.progress(t, dataset:size())

      else
         error('unknown optimization method')
      end
   end

   -- time taken
   time = sys.clock() - time
   time = time / dataset:size()
   print("<trainer> time to learn 1 sample = " .. (time*1000) .. 'ms')

   -- print confusion matrix
   print(confusion)
   trainLogger:add{['% mean class accuracy (train set)'] = confusion.totalValid * 100}
   confusion:zero()

   -- save/log current net
   local filename = paths.concat(opt.save, 'mnist.net')
   os.execute('mkdir -p ' .. sys.dirname(filename))
   if paths.filep(filename) then
      os.execute('mv ' .. filename .. ' ' .. filename .. '.old')
   end
   print('<trainer> saving network to '..filename)
   -- torch.save(filename, model)

   -- next epoch
   epoch = epoch + 1
end

-- test function
function test(dataset)
   -- local vars
   local time = sys.clock()

   -- test over given dataset
   print('<trainer> on testing Set:')
   for t = 1,dataset:size(),opt.batchSize do
      -- disp progress
      xlua.progress(t, dataset:size())

      -- create mini batch
      local inputs = torch.Tensor(opt.batchSize,1,geometry[1],geometry[2])
      local targets = torch.Tensor(opt.batchSize)
      local k = 1
      for i = t,math.min(t+opt.batchSize-1,dataset:size()) do
         -- load new sample
         local sample = dataset[i]
         local input = sample[1]:clone()
         local _,target = sample[2]:clone():max(1)
         target = target:squeeze()
         inputs[k] = input
         targets[k] = target
         k = k + 1
      end

      -- test samples
      local preds = model:forward(inputs)

      -- confusion:
      for i = 1,opt.batchSize do
         confusion:add(preds[i], targets[i])
      end
   end

   -- timing
   time = sys.clock() - time
   time = time / dataset:size()
   print("<trainer> time to test 1 sample = " .. (time*1000) .. 'ms')

   -- print confusion matrix
   print(confusion)
   testLogger:add{['% mean class accuracy (test set)'] = confusion.totalValid * 100}
   confusion:zero()
end


----------------------------------------------------------------------
-- and train!
--
while true do
   -- train/test
   train(trainData)
   test(testData)

   -- plot errors
   if opt.plot then
      trainLogger:style{['% mean class accuracy (train set)'] = '-'}
      testLogger:style{['% mean class accuracy (test set)'] = '-'}
      trainLogger:plot()
      testLogger:plot()
   end
end

@brisker
Copy link

brisker commented Oct 20, 2016

@amiltonwong
error when running the code you provided for training mnist: why?
~/torch/install/bin/luajit: bad argument #3 to '?' (torch.*Tensor expected, got userdata)
stack traceback:
[C]: at 0x7fdfe528df90
[C]: in function '__newindex'
train-on-mnist.lua:201: in function 'train'
train-on-mnist.lua:358: in main chunk
[C]: in function 'dofile'
.../torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

@yasudak
Copy link

yasudak commented Feb 27, 2017

-      local inputs = torch.Tensor(opt.batchSize,1,geometry[1],geometry[2])
-      local targets = torch.Tensor(opt.batchSize)
+      local inputs = torch.CudaTensor(opt.batchSize,1,geometry[1],geometry[2])
+      local targets = torch.CudaTensor(opt.batchSize)

In this way it worked.

@lebinhe
Copy link

lebinhe commented Mar 1, 2017

When i running the code from @amiltonwong and @yasudak , error occured.

set nb of threads to 4
using model:
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> output]
(1): nn.SpatialConvolutionMM(1 -> 32, 5x5)
(2): nn.Tanh
(3): nn.SpatialMaxPooling(3x3, 3,3, 1,1)
(4): nn.SpatialConvolutionMM(32 -> 64, 5x5)
(5): nn.Tanh
(6): nn.SpatialMaxPooling(2x2, 2,2)
(7): nn.Reshape(576)
(8): nn.Linear(576 -> 200)
(9): nn.Tanh
(10): nn.Linear(200 -> 10)
}
only using 2000 samples to train quickly (use flag -full to use 60000 samples)
loading only 2000 examples
done
loading only 1000 examples
done
on training set:
online epoch # 1 [batchSize = 10]
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-1543/cutorch/lib/THC/generic/THCTensorMath.cu line=15 error=8 : invalid device function
/data/offline/anaconda2/3rd/torch/install/bin/luajit: ...aconda2/3rd/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
...ne/anaconda2/3rd/torch/install/share/lua/5.1/nn/THNN.lua:110: cuda runtime error (8) : invalid device function at /tmp/luarocks_cutorch-scm-1-1543/cutorch/lib/THC/generic/THCTensorMath.cu:15
stack traceback:
[C]: in function 'v'
...ne/anaconda2/3rd/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'SpatialConvolutionMM_updateOutput'
.../torch/install/share/lua/5.1/nn/SpatialConvolutionMM.lua:63: in function <.../torch/install/share/lua/5.1/nn/SpatialConvolutionMM.lua:53>
[C]: in function 'xpcall'
...aconda2/3rd/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...conda2/3rd/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
train-on-mnist-cuda2.lua:219: in function 'opfunc'
.../anaconda2/3rd/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd'
train-on-mnist-cuda2.lua:272: in function 'train'
train-on-mnist-cuda2.lua:357: in main chunk
[C]: in function 'dofile'
.../3rd/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x004064f0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
...aconda2/3rd/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
...conda2/3rd/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
train-on-mnist-cuda2.lua:219: in function 'opfunc'
.../anaconda2/3rd/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd'
train-on-mnist-cuda2.lua:272: in function 'train'
train-on-mnist-cuda2.lua:357: in main chunk
[C]: in function 'dofile'
.../3rd/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x004064f0

@WEldred
Copy link

WEldred commented Mar 15, 2017

This code doesn't work for me. I'm seeing the following that tells me that garbage is probably being loaded as the input:

time to learn 1 sample = 1.0377799272537ms =================================>.] ETA: 6ms | Step: 0ms
ConfusionMatrix:
[[ 1 190 0 0 0 0 0 0 0 0] 0.524% [class: 1]
[ 0 220 0 0 0 0 0 0 0 0] 100.000% [class: 2]
[ 0 198 0 0 0 0 0 0 0 0] 0.000% [class: 3]
[ 0 191 0 0 0 0 0 0 0 0] 0.000% [class: 4]
[ 0 214 0 0 0 0 0 0 0 0] 0.000% [class: 5]
[ 0 180 0 0 0 0 0 0 0 0] 0.000% [class: 6]
[ 0 200 0 0 0 0 0 0 0 0] 0.000% [class: 7]
[ 0 224 0 0 0 0 0 0 0 0] 0.000% [class: 8]
[ 0 172 0 0 0 0 0 0 0 0] 0.000% [class: 9]
[ 0 210 0 0 0 0 0 0 0 0]] 0.000% [class: 10]

  • average row correct: 10.052356021479%

@ocanevet
Copy link

ocanevet commented Mar 29, 2017

You need to put the following lines

-- retrieve parameters and gradients
parameters,gradParameters = model:getParameters()

after the model has been copied to the GPU:

-- convert into cuda version
model = model:cuda()
-- convert into cuda version
criterion = criterion:cuda()

Otherwise the model trains on the GPU and then gets back the initialised model again and again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants