black lanes in batched texture lookup

Question

black lanes in batched texture lookup

kfjahnke opened this issue 3 months ago · comments

Describe the bug

This is an odd bug. I can't reproduce it reliably, but it occurs quite regularly. When I use OIIO's batched texture lookup, every now and then, all batches of results come back with a specific set of lanes which contain black pixels. The pattern of lanes remains the same through the entire program, but it varies - sometimes it's just one lane, sometimes several.

I see this behaviour both with OIIO 2.4.17 from my package management and from a build from master (2.6.something).

Please run oiiotool --buildinfo and paste the output here.

I suppose this will be enough:

Input formats supported: bmp, cineon, dds, dicom, dpx, ffmpeg, fits, gif, hdr, heif, ico, iff, jpeg, jpeg2000, null, openexr, openvdb, png, pnm, psd, raw, rla, sgi, softimage, targa, tiff, webp, zfile
Output formats supported: bmp, dpx, fits, gif, hdr, heif, ico, iff, jpeg, jpeg2000, null, openexr, png, pnm, rla, sgi, targa, term, tiff, webp, zfile
OpenColorIO 2.1.3, color config: built-in
Known color spaces: "linear", "default", "rgb", "RGB", "sRGB", "Rec709"
Filters available: box, triangle, gaussian, sharp-gaussian, catmull-rom, blackman-harris, sinc, lanczos3, radial-lanczos3, nuke-lanczos6, mitchell, bspline, disk, cubic, keys, simon, rifman
Dependent libraries: OpenEXR 3.1.5, LIBTIFF Version 4.5.1, jpeg-turbo 2.1.5/jp62, dcmtk 3.6.7, FFMpeg 6.0 (Lavf60.16.100), gif_lib 5.2.2, libheif 1.17.6, OpenJpeg 2.5.0, null 1.0, OpenVDB 10.0.1abi10, libpng 1.6.43, libraw 0.21.2-Release, Webp
    1.3.2
OIIO 2.4.17.0 built for C++17/201703 sse2
Running on 4 cores 15.3GB sse2,sse3,ssse3,sse41,sse42,avx,avx2,fma,f16c,popcnt,rdrand

Also please tell us if there was anything unusual about your environment or
nonstandard build options you used.

I'm running this on debian testing, nothing unusual. I also tried on a different machine running ubuntu 22.04 LTS with OIIO 2.2.18, same thing there.

To Reproduce

Steps to reproduce the behavior:

compile this program:

#include <OpenImageIO/texture.h>
#include <random>

#define LANES 16

using namespace OIIO ;
int main ( int argc , char * argv[] )
{
  float s[LANES] ;
  float t[LANES] ;
  float dsdx[LANES], dtdx[LANES], dsdy[LANES], dtdy[LANES] ;
  float px[3*LANES] ;

  for ( int i = 0 ; i < LANES ; i++ )
    dsdx[i] = dtdx[i] = dsdy[i] = dtdy[i] = 0 ;

  
  std::mt19937 gen(42); // level playing field
  std::uniform_real_distribution<> dis(0.0, 1.0);
  TextureSystem * ts = TextureSystem::create() ;
  ustring ufilename ( argv[1] ) ;
  auto th = ts->get_texture_handle ( ufilename ) ;
  TextureOptBatch batch_options ;
  for ( int i = 0 ; i < LANES ; i++ )
    batch_options.swidth[i] = batch_options.twidth[i] = 0 ;

  for ( int k = 0 ; k < 100000 ; k++ )
  {
    for ( int i = 0 ; i < LANES ; i++ )
    {
      s[i] = dis(gen) ;
      t[i] = dis(gen) ;
    }
    bool result =
    ts->texture ( th , nullptr , batch_options , Tex::RunMaskOn ,
                  s , t ,
                  dsdx, dtdx, dsdy, dtdy ,
                  3 , px ) ;

    assert ( result ) ;
    for ( int i = 0 ; i < LANES ; i++ )
    {
      if ( px[i] == 0.0f && px[LANES+i] == 0.0f && px[2*LANES+i] == 0.0f )
        std::cout << "bingo " << k << " " << i << std::endl ;
    }
  }
  return 0 ;
}

let's say the program is called texture_fuzz.cc, compile like this (clang++ does the same):

g++ -O3 -std=c++17 -otexture_fuzz texture_fuzz.cc -lOpenImageIO -lOpenImageIO_Util

Then run it, passing an image file as the sole argument.

Then THIS happens (reproduce the exact error message if you can)
... this is the hard bit. The expected behaviour is something like this:

bingo 1063 15
bingo 1174 13
bingo 1982 9
bingo 2080 8
bingo 4386 4
bingo 5574 10
bingo 8361 0
bingo 8744 10
bingo 10935 11
bingo 12076 1
bingo 18688 4
bingo 19130 9

so, the program detects the odd output pixel which is entirely black - if there are no black pixels in the input, there may be no output.
But every now and then, the bug happens and the output becomes something like this:

bingo 0 1
bingo 0 3
bingo 0 9
bingo 0 11
bingo 1 1
bingo 1 3
bingo 1 9
bingo 1 11
and so on.

This is driving me nuts. Maybe I'm missing something, but I can't find any hint in the docu, e.g. more stuff I have to pass to OIIO to make it peform correctly - I tried all sorts of batch options, all to no avail. It looks like data being accessed through a pointer to memory which may or may not contain the correct content (like, it was there but may have been altered in the meantime) - or loading from a dangling pointer - just guessing. Like the mask gone random and then stuck, but passing in Tex::RunMaskOn should set all lanes. So I'm at a loss. I'd appreciate if someone could at least reproduce the behaviour - or tell me what I'm missing.

kfjahnke · Answer 1 · Wed Apr 10 2024 01:11:58 GMT+0800 (China Standard Time)

One step further: the bug seems to occur only with MipModeAniso - at least, so far, I've been unable to trigger it with other mipmapping modes. So I suppose that's why I got to see it: MipModeAniso is the default, and the most interesting one, offering the anisotropic antialiasing filter.
So far, except for one trial which only experienced the bug in the second invocation of 'texture' (this is really odd!), all occurences of the bug happened immediately in the first invocation, and most of the time several lanes were affected. Trials with no affected lanes were significantly more frequent than trials with one or several. From this behaviour my guess is now one or several uninitialized variables somewhere in the RGB gleaning code.

kfjahnke · Answer 2 · Thu Apr 11 2024 19:17:22 GMT+0800 (China Standard Time)

I think I found it! It's the sblur and tblur parameters in the batch options. They are not initialized in the c'tor, and the swidth and twidth parameters aren't initialized either. The plain TextureOpts c'tor does initialize them, but in the 'texture' function the value for the TextureOpt is overwritten with the (uninitialized) value from the batch options. I think it might be a good idea to initialize these values to the defaults in TextureOptBatch's c'tor, e.g. in the same loop where rnd is initialized. I'll try that and report again.

kfjahnke · Answer 3 · Fri Apr 12 2024 01:02:56 GMT+0800 (China Standard Time)

That fixes it. I did this:

~/src/OpenImageIO$ git diff
diff --git a/src/include/OpenImageIO/texture.h b/src/include/OpenImageIO/texture.h
index 2fb88bc94..5713156cf 100644
--- a/src/include/OpenImageIO/texture.h
+++ b/src/include/OpenImageIO/texture.h
@@ -311,8 +311,15 @@ public:
     /// Create a TextureOptBatch with all fields initialized to reasonable
     /// defaults.
     TextureOptBatch () {
-        for (int i = 0; i < Tex::BatchWidth; ++i)
-            rnd[i] = -1.0f;
+        for (int i = 0; i < Tex::BatchWidth; ++i) {
+            rnd[i] = -1.0f ;
+            sblur[i] = 0.0f ;
+            tblur[i] = 0.0f ;
+            rblur[i] = 0.0f ;
+            swidth[i] = 1.0f ;
+            twidth[i] = 1.0f ;
+            rwidth[i] = 1.0f ;
+        }
     }
 
     // Options that may be different for each point we're texturing

This way, the width and blur parameters are initialized to the same default as in a plain TextureOpt.
While the bug is not fixed, the workaround is to initialize these members explicitly, like:

  TextureOptBatch batch_options ;

  for ( int i = 0 ; i < 16 ; i++ )
    batch_options.swidth[i] = batch_options.twidth[i] = 1 ;
  
  for ( int i = 0 ; i < 16 ; i++ )
    batch_options.sblur[i] = batch_options.tblur[i] = 0 ;

... use the batch options

kfjahnke · Answer 4 · Fri Apr 12 2024 14:28:01 GMT+0800 (China Standard Time)

Let me add a bit of opinion. Failing to initialize the batch options happened to me because I trusted the documentation in so far as to assume that the batch options would behave like the single-point texture options. With my fix, this should now be the case. The documentation was the first place where I tried to find a solution for my problem, so I studied it carefully. I happen to be quite involved in SIMD programming, so this made perfect sense to me:

On CPU architectures with SIMD processing, texturing entire batches of samples at once may provide
a large speedup compared to texturing each sample point individually.

I assumed that, offering a SIMDized API, you would indeed provide SIMDized code. But you make no such claim: you say 'may provide'. Looking at the code, when the documentation did not help me fix the bug (nor the community of contributors who ignored this thread), what did I see:

TextureSystemImpl::texture(TextureHandle* texture_handle,
                                              Perthread* thread_info, TextureOptBatch& options,
  ...
{
    // (FIXME) CHEAT! Texture points individually

And then you proceed to chop up the incoming data, which are perfectly ready for SIMD processing, into parameter sets for calls to single-point texture lookups and run the code in a loop over the lanes. CHEAT indeed. No surprise the code is so slow - I had assumed it was due to the complex nature of the anisotropic antialiasing filter and the ripmap you would have to set up for it, but now I suspect it's merely due to the lack of SIMD processing. The acrobatics of first deinterleaving the incoming data into SIMD-load-ready memory to feed them to the batched lookup and then processing this memory with single-float reads and the conditionals to cater for the mask is sure to slow the process down additionally, so I wouldn't be surprised to see that going via the batched lookup is actually slower than using single-point lookups straight away. So what's the point of offering an API for batched texture lookup? Providing an empty shell hoping that someone will come along to provide proper code? I assume that you, @lgritz, are the author of this code, so I invoke you by name, maybe you can shed some light on this topic.

SIMD is a big issue, and it requires a different style of programming. But there are tools to make the transition easier. When I started out with SIMD programming, the library to use was - in my opinion - Vc, which has mutated into what's now std::simd, losing a few good features on the way because the C++ standard committee did not think it a good idea to have std::simd provide stuff like gather and scatter operations, and the version of std::simd which made it into the GNU version did drop another few good features like the SIMDized version of atan2. The whole show slowed down, the author of Vc moved on from his thesis project, and now Vc is in maintenance mode and std::simd has not evolved much. But Vc's - and std::simd's way of tackling the SIMD access by providing SIMD-capable types which behave very much like 'normal' scalar operands is still valid, and I did use it to good effect in my code. When I decided to move to highway to get support for other architectures and for AVX512, I decided to build an abstraction layer which provides pretty much the same interface as Vc, but can use several SIMD back-ends under the hood - and at the same time offer the familiar interface, so did not have to modify my Vc-based code. Later I factored that out into zimt. It's easy to use, it's free, and one can use it to program SIMD code without the pain involved in using intrinsics - even highway's portable intrinsics are quite a mouthful, but the zimt layer provides a uniform, Vc-like interface.

Let me add this: SIMD is now available in pretty much every CPU on the market. But it's still not used much. Why is that? Because of needed change in programming style, the difficulty to address the different architectures and the inertia of extant code which would have to be modified. But using SIMD pays off big time. A lot of stuff which is now done on the GPU (which is, as a programming option, much harder to tackle than SIMD but nevertheless very popular) can be done with SIMD, and then it's often enough so fast that the GPU is not even needed. AFAICT, OIIO is CPU-based. Texture lookup is the achilles heel of the entire texture system, every pixel you emit goes through this bit of code. Failing to use SIMD there is about the worst place not to use it. Please do consider using it - if you don't like zimt because it's from an independent source with no industry backing, I recommend you use highway directly - it's sort of from google ("not an officially supported Google product"), who are members of ASWF, and they have helped me a lot to get my use of their library up and running. They might even be interested in helping with OIIO's batched texture lookup, because it can showcase their library and help it get a wider user base.

Larry Gritz · Answer 5 · Sat Apr 13 2024 05:36:12 GMT+0800 (China Standard Time)

You don't need to educate me about SIMD basics, thanks.

You may have noticed that OIIO's simd.h that provides SIMD data type classes that can be used without anybody needing to know about intrinsics, etc., and thus probably would not benefit from any of those other 3rd party dependencies you mention that do mostly the same thing.

You may also have noticed that the current single-lookup texture operations do use SIMD extensively in their implementation -- it's just that it's mainly used to parallelize all the math on the 3 or 4 color channels needed for each texture lookup, rather than trying to use SIMD to compute many texture lookups at once. You can't SIMD-ize both at the same time. So it already isn't simply a naive scalar implementation, it does use SIMD, just not in the way you might expect.

At some point, we provided a batched API to TextureSystem, but at first only provided a naive implementation that loops over the points. The reason we never got around to replacing the implementation with a more sophisticated SIMD implementation that parallelizes across shade lookups (instead of across channels) is simple:

My company's renderer, like most other high-end renderers, traces individual rays rather than coherent batches, and so would not benefit from any improvement to the batched API, which we don't currently use. For that reason, it makes no sense for me to spend time implementing it.
Nobody else stepped forward to do the implementation work.
Also nobody else who uses OIIO's texture system has particularly asked for it.

Note that points 1 and 3 also explain how the bug reported in this issue has gone so long without being previously reported -- it's very possible that nobody is relying on the batched texture lookups.

I would welcome a PR that would improve the implementation behind the batched API texture calls.

kfjahnke · Answer 6 · Sat Apr 13 2024 17:59:20 GMT+0800 (China Standard Time)

You don't need to educate me about SIMD basics, thanks.

Sorry, I didn't mean to step on your toes. My code is using SIMD throughout, and when I saw the vectorized API I was happy, because I could just pass on the SIMD data I already had going. When I saw later that it was just a loop over the lanes I was disappointed, because it sort of threw a spanner in the works.

parallelize all the math on the 3 or 4 color channels needed for each texture lookup

That's the point which irked me. I did see the vectorization across the colour channels. Using vertical vectorization was okay for SSE, when a vector didn't have many lanes. But as CPUs improve and the lane count increases, you need horizontal vectorization to exploit the wider SIMD units. I recommended highway, which is modern and good for back-end work, providing 'portable intrinsics' which work across architectures. But it's a bit low-level, that's why I also mentioned my library zimt, which gives it a 'friendlier face'. I'm trying to give helpful hints - I've done a lot of research and reading trying to find the best solution.

My company's renderer, like most other high-end renderers, traces individual rays rather than coherent batches, and so would not benefit from any improvement to the batched API, which we don't currently use. For that reason, it makes no sense for me to spend time implementing it.

Okay, I see where you're coming from. In a way, lux also does ray tracing, but of course it's much less complex and focused on speed, so that I can produce 60fps for fluid animations of single-image visualizations - most synoptic views are too complex to run in real-time, even with SIMD. But if the rendition becomes very complex and varies a lot between the rays, using SIMD does become difficult indeed.

the bug reported in this issue has gone so long without being previously reported

So you also consider this a bug. Can you simply fix it, adding the missing initialization?

it's very possible that nobody is relying on the batched texture lookups.

That's what it felt like. In a way I found it surprising to find stuff like the image cache and the texture system in an image i/o library, but since my work is pretty much along the same lines, I see how it's logical to move on in that direction. I do appreciate your design, and I understand that it's hard to get it all implemented.

Larry Gritz · Answer 7 · Sat Apr 13 2024 22:38:09 GMT+0800 (China Standard Time)

That's the point which irked me. I did see the vectorization across the colour channels. Using vertical vectorization was okay for SSE, when a vector didn't have many lanes. But as CPUs improve and the lane count increases, you need horizontal vectorization to exploit the wider SIMD units. I recommended highway, which is modern and good for back-end work, providing 'portable intrinsics' which work across architectures. But it's a bit low-level, that's why I also mentioned my library zimt, which gives it a 'friendlier face'. I'm trying to give helpful hints - I've done a lot of research and reading trying to find the best solution.

I do appreciate the suggestions. I think I have both of these bookmarked and will take a look. If we wanted a more "horizontal" style of SIMD, we might use one of those, or add more to our simd.h to make it even easier.

But really, it comes down to the fact that my own use case for the texture system doesn't generate batched queries, so I personally don't have the time to do this work. As far as I know, none of the other major renderers that use OIIO's TextureSystem do, either. And most of them are all-hands-on-deck of doing GPU ports and for the most part have decided that Cuda+OptiX is the way forward for fast parallel ray tracing, rather than AVX-512.

But if the rendition becomes very complex and varies a lot between the rays, using SIMD does become difficult indeed.

That's exactly the problem. We did implement it all at one point in our renderer, but it was really difficult to maintain enough coherence in the rays to fill out enough lanes to consistently come out ahead, and when we finally got rid of that extra mechanism and went back to single point shading, the renderer code got a lot simpler. And it's easier to optimize simple code. We do use SIMD in the renderer, too, but again in a more "vertical" way, like using SIMD to test several bounding boxes against one ray at a time, rather than trying to juggle many rays at once.

So you also consider this a bug. Can you simply fix it, adding the missing initialization?

Of course. I'll try to do it today, on my weekend, because nobody else seems willing to just submit the PR themselves even though they know exactly what to fix.

In a way I found it surprising to find stuff like the image cache and the texture system in an image i/o library,

OpenImageIO was very much conceived as "the image IO parts of a renderer", and also it's a companion project to Open Shading Language. Sometimes it's hard to know precisely where the boundary between them should live, and in an alternate universe, the IC/TS might be on the OSL side, or exist as yet a third project that sits between them.

Larry Gritz · Answer 8 · Sun Apr 14 2024 00:04:41 GMT+0800 (China Standard Time)

Proposed fix in #4226

kfjahnke · Answer 9 · Sun Apr 14 2024 02:42:41 GMT+0800 (China Standard Time)

Thanks for the prompt fix, even on a weekend! I tried it out here, and it works for me.

OpenImageIO was very much conceived as "the image IO parts of a renderer"

That's how lux came to be: as a demo program for my SIMDized b-spline library. And now I found 'the image IO part' to replace the one I'm currently using, and, lo and behold, it can even do rendering! So I couldn't resist and gave it a spin.

I am currently working on environments, both lat/lon and cubemap. I am unhappy with openEXR's exrenvmap utility because it seems not to produce entirely correct results - see this issue if you're interested. And I took this as a hint to look at cubemaps again, because I'm not entirely happy with the cubemap processing in lux either. So far this has been a rewarding journey, and I've come up with a few interesting new ideas which I'm testing out.

I was trying to interest them in looking at their code again, but it came out pretty much like 'we won't fix it but if you do it we'll take it'. They said that they wanted something with 'better mathematics' (their code is 20 years old) - so I thought that using OIIO's environment lookup would be just the ticket. At least I could use their utility to create a cubemap to their standards to try and feed it to OIIO, when OIIO didn't like the cubemaps I had made. Then it turned out that OIIO only supports lat/lon (at least that's the impression I get) so I had to use not only the batched environment function but also the batched texture function - and that's how this bug surfaced, showing that precisely the two functions I wanted weren't SIMDized all through and had this bug... argh!

Now I won't go on much longer on a Saturday, but I'd say that a texture lookup is probably simple enough to fully do in SIMD, even if complex ray tracing is not. I have a clear idea of how this code works - interpolation is one of my special interests. I see the biggest stumbling stone in the tiled nature of the data. Whenever the iteration hits upon a tile which isn't already in RAM, the code has to get off it's set pattern and deal with it, and this is detrimental to speed and doesn't work well with SIMD. But I also want to switch to using tiled data, so I've been thinking about this for some time now - maybe I'll figure something out.

Larry Gritz · Answer 10 · Sun Apr 14 2024 03:01:40 GMT+0800 (China Standard Time)

No problem, I'm happy to do it.

When we added the batched texture API calls, I really did mean to get back to do the implementation. But like I said, our renderer turned away from batching rays (thus, no batched texture lookups), and then later moved more toward GPU for ways to get higher performance, so the priority for doing it was greatly diminished. I don't think it would be very hard to do a good horizontal SIMD implementation. I think it would be fun, it's really all I can do not to just jump on it myself. :-) But I have so many other things to do that is directly needed by my employers or other people in the industry, it would be unwise for me to spend the time on it myself.

Does OIIO not do cubemap lookups? Maybe I'm thinking of the last texture system I wrote, which did. I think I meant to add that to OIIO like 15 years ago, but the first renderer we added it to at SPI was at that point only latlong, so finishing the direct cubemap lookups were postponed, and somehow it just never bubbled up the priority list (and apparently none of the other renderers using OIIO::TS have complained loudly enough). I'm not sure why it is, exactly, I was always partial to cube maps myself, but it does seem that latlongs have mostly taken over in that style of renderer.

Larry Gritz · Answer 11 · Sun Apr 14 2024 03:02:28 GMT+0800 (China Standard Time)

Anyway, if you could go to the PR and approve it if the code looks right to you, I will merge it and also backport to the next release patch that will happen in a couple weeks.

kfjahnke · Answer 12 · Sun Apr 14 2024 03:17:36 GMT+0800 (China Standard Time)

Does OIIO not do cubemap lookups?

Well, as far as I can tell it doesn't. I fed a cubemap and tried to process it with calls to 'environment', but it clearly did not understand the format. This was confusing, because in the comments to your code you tell an interesting story about how openEXR has specific ideas about cubemaps which aren't necessarily what one would expect (like, repeating all edges, probably to avoid trouble with bilinear lookups) - so I thought it should be there somehow and I was just doing something wrong.

I always thought they were simply using a standard which coincides with openGL cubemaps and so would everyone else, but now I'm not so sure... but my cubemap code is maturing, now it works with OIIO's texture lookup as well, and I even already have a very fast fully SIMDized bilinear pickup, which is easy because I hold all the data in RAM and don't have to deal with tiles.

kfjahnke · Answer 13 · Tue Apr 16 2024 17:59:21 GMT+0800 (China Standard Time)

I think it would be fun, it's really all I can do not to just jump on it myself. :-)
I was always partial to cube maps myself

It is fun indeed, and I've enjoyed coding it. I've now come up with a nice utility to convert between lat/lon and cubemap. It's written using zimt and OIIO. It started out as a demo program for zimt, but since it was so much fun coding it, and since openEXR's conversion tool doesn't really work for me, I elaborated:

arbitrary size of output
choose between OIIO's antialiased lookup and (fast) direct bilinear interpolation with zimt
process files with one, three or four channels
choice of SIMD back-end by compile-time switch (Vc, highway, std::simd or zimt's own)

The code is here - it's amply commented, and I hope it conveys a good idea how zimt is helpful to code stuff like this in multithreaded SIMD without the pain of using intrinsics and the likes - and it optionally uses OIIO's batched texture and environment lookups.

One thing which I'd like to point out as a distinguishing feature is the way how this program represents cubemaps internally. The raw cube face images are surrounded with an additional frame of pixels interpolated from neighbouring cube faces to produce an IR image with sufficient support for good interpolators and mip-mapping. This is to make sure that there are no artifacts along the cube's edges.