torch.norm on CPU gives incorrect results for large Tensors

Question

torch.norm on CPU gives incorrect results for large Tensors

soumith opened this issue 6 years ago · comments

🐛 Bug

>>> import torch
>>> x = torch.ones(40000)
>>> torch.norm(x)
tensor(266.0605)
>>> torch.norm(x.to(dtype=torch.float32, device='cuda'))
tensor(200., device='cuda:0')

Originally reported at https://discuss.pytorch.org/t/output-of-torch-norm-x-depends-on-size-of-x-not-in-a-good-way/33299/8

cpuhrsch · Answer 1 · Thu Jan 10 2019 03:40:10 GMT+0800 (China Standard Time)

cc @xhzhao - I was able to bisect this down to 99c0b96 - do you have time to also take a look?

The script I used to bisect this is

python -c "import torch; import sys; sys.exit(0) if ((2 * torch.norm(torch.ones(10000))) == torch.norm(torch.ones(40000))) else sys.exit(1)"

For now I'll write a PR to revert the optimizations

cpuhrsch · Answer 2 · Thu Jan 10 2019 03:41:38 GMT+0800 (China Standard Time)

cc @colesbury for ReduceOps

cpuhrsch · Answer 3 · Thu Jan 10 2019 04:55:24 GMT+0800 (China Standard Time)

@xhzhao - I think you're using at::parallel_reduce incorrectly. Parallel_reduce takes a function that allows you to do a reduction over a section of the input and a function that is able to reduce partial results.

For example, you might have a tensor of 10000 entires and want to sum together all the elements. Parallel_reduce with a grain_size of 2500 will then allocate an intermediate result tensor with 4 elements. Then it will execute the function "f" you provide on each of these chunks of 2500 values, so 0-24999, 2500-4999, etc. It will write out the result from each of these chunks into the intermediate result tensor. After that it'll reduce the partial results from each chunk into a single number using the scalar function sf, which for sum would be "+". This is similar to tbb's approach, where you need to provide a function to accumulate a subrange, a function to combine two partial results and an identity.

In the norm kernel code you also pass this plus operator to combine two partial results. That means, when passing a tensor of 40000 1s, as in the script we use to reproduce this, for the default grain size, parallel_reduce will split the tensor into two chunks 0-32767 and 32768-39999. Your provided partial reduction function then returns the norm for each of these chunks which is 181.019 and 85.0412 respectively. However, combining those two values using add does not give you the correct overall results. You need to specify a different per-scalar combination function such as squaring, summing and then taking the square root again (for the 2-norm), but this will lead to numerical instability depending on the values and number of chunks.

Soumith Chintala · Answer 4 · Thu Jan 17 2019 07:23:19 GMT+0800 (China Standard Time)

closed via #15885