cvjena / cn24

Convolutional (Patch) Networks for Semantic Segmentation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bus error when testing

zygmuntz opened this issue · comments

When I type test, the CPU usage goes up and nothing happens for about a minute, then a message appears about a bus error. Could this have to do with image size (1500x1500)?

INF [ Trainer::Epoch(156) ] Epoch: 99, it: 500, bsize: 16, current lr: 2.33894e-06
.....10%.....20%.....30%.....40%.....50%.....60%.....70%.....80%.....90%....
DBG [ Trainer::Epoch(221) ] Training, sps: 8628.59
DBG [ Trainer::Epoch(226) ] Training, tps: 115.894 us
DBG [ Trainer::Epoch(232) ] Training, GB/s   up: 0.000128576
DBG [ Trainer::Epoch(233) ] Training, GB/s down: 0.000128576
INF [ Trainer::Epoch(241) ] Training (Epoch 99, node 0) Square Loss Layer (Weight: 1) lps: 0.138364
INF [ Trainer::Epoch(243) ] Training (Epoch 99) aggregate lps: 0.138364
RESULT --- Training  - Epoch 99 - F1 : 12.6683% (t=-1)
RESULT --- Training  - Epoch 99 - ACC: 6.7625%
RESULT --- Training  - Epoch 99 - PRE: 6.7625%
RESULT --- Training  - Epoch 99 - REC: 100%
RESULT --- Training  - Epoch 99 - FPR: 100%
RESULT --- Training  - Epoch 99 - FNR: 0%
DBG [ Trainer::Reset(80) ] Resetting Trainer state
INF [ NetGraph&, Conv::NetGraph&, Conv::Trainer&, Conv::Trainer&, bool, std::string&)(296) ] Training complete.
 > test

DBG [ Trainer::Reset(80) ] Resetting Trainer state
DBG [ DatasetInputLayer::SetTestingMode(242) ] Enabled testing mode.
DBG [ Trainer::Test(90) ] ./trainnet.sh: line 3: 20806 Bus error               (core dumped) /home/ubuntu/cn24/build/trainNetwork -v /home/ubuntu/data/config.set /home/ubuntu/data/arch.net

Hi,
it could. Could you attach a stack trace?

 > test

DBG [ Trainer::Reset(80) ] Resetting Trainer state
DBG [ DatasetInputLayer::SetTestingMode(242) ] Enabled testing mode.
DBG [ Trainer::Test(90) ]
Program received signal SIGBUS, Bus error.
0x00007ffff7b7be43 in Conv::Tensor::CopyMap(Conv::Tensor const&, unsigned long, unsigned long, Conv::Tensor&, unsigned long, unsigned long) () from /home/ubuntu/cn24/libcn24.so
(gdb)  bt
#0  0x00007ffff7b7be43 in Conv::Tensor::CopyMap(Conv::Tensor const&, unsigned long, unsigned long, Conv::Tensor&, unsigned long, unsigned long) () from /home/ubuntu/cn24/libcn24.so
#1  0x00007ffff7b7bcab in Conv::Tensor::CopySample(Conv::Tensor const&, unsigned long, Conv::Tensor&, unsigned long) () from /home/ubuntu/cn24/libcn24.so
#2  0x00007ffff7b6d398 in Conv::TensorStreamDataset::GetTestingSample(Conv::Tensor&, Conv::Tensor&, Conv::Tensor&, Conv::Tensor&, unsigned int, unsigned int) () from /home/ubuntu/cn24/libcn24.so
#3  0x00007ffff7ba8baa in Conv::DatasetInputLayer::FeedForward() () from /home/ubuntu/cn24/libcn24.so
#4  0x00007ffff7b8bec3 in Conv::NetGraph::FeedForward(Conv::NetGraphNode*) ()
   from /home/ubuntu/cn24/libcn24.so
#5  0x00007ffff7b8bd89 in Conv::NetGraph::FeedForward(std::vector<Conv::NetGraphNode*, std::allocator<Conv::NetGraphNode*> >&, bool) () from /home/ubuntu/cn24/libcn24.so
#6  0x00007ffff7b8bcb4 in Conv::NetGraph::FeedForward() () from /home/ubuntu/cn24/libcn24.so
#7  0x00007ffff7b84d1b in Conv::Trainer::Test() () from /home/ubuntu/cn24/libcn24.so
#8  0x000000000040d04d in parseCommand(Conv::NetGraph&, Conv::NetGraph&, Conv::Trainer&, Conv::Trainer&, bool, std::string&) ()
#9  0x000000000040bdf9 in main ()

Confirmed that it has to do with test set size. Well, I guess one just has to use small sets for testing.

Just for reference, we have no problems using >70GB datasets without any special measures. You may want to check your system configuration because CN24 does not do anything out of the ordinary in terms of system calls.

@clrokr What kind of hardware (in terms of main and GPU memory) do you use for those datasets?

Training usually takes place on a Xeon E3-1231v3 workstation with 16GB main memory and an R9 290X with 4GB of VRAM.