NVIDIA / MatX

An efficient C++17 GPU numerical computing library with Python-like syntax

Home Page:https://nvidia.github.io/MatX

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] tests "OperatorTests.cu" and "ReductionTests.cu" fail to compile

siegLoesch opened this issue · comments

Being aware that you were working hard to improve the correct compilation of unit tests I upgraded MatX to revision: 20230506.134328. Nearly all of the unit tests comiple without failure but not "OperatorTests.cu" and "ReductionTests.cu" [see:
MatXConfBuildInstallFailedTests.log

]. For the moment I excluded those two from the list <test_sources> given in the file <$MATX_HOME/test/CMakeLists.txt>.
The same, with the same files, happens as soon as activate the cmake benchmarks option with -DMATX_BUILD_BENCHMARKS:BOOL=YES which I deactivated at the moment because I do not know if they are neccessary to run the bench.

Please find system and cuda versions inside the above attached file and let me know if I doing something wrong or wheter you are already aware of this.

Best regards
Siegfried

Hi @siegLoesch, this may be related, but in our readme we have:

Note: Both CUDA 12.0.0 and CUDA 12.1.0 have an issue that causes building MatX unit tests to show a compiler error or cause a segfault in the compiler. We are looking into this issue. CUDA 11.8 does not have either of these issues.

I should highlight this more prominently, but there's a compiler bug preventing this from working at the moment. Have you tried older CUDA versions?

Hello Cliff,
thank you for your fast response!
I had your note in mind but thought you solved these issues already. The reason is that I tried to use CUDA 11.8 before with much less success. The reason for that might be that my default host compiler is gcc-12.2 which is not compatible with 11.8 what requires <-allow-unsupported-compiler> to be used [see:
MatXConfBuildInstallFailedTests-gcc-12_2-Cuda-11_8.log].
But using a combination of gcc-11.3 with CUDA 11.8 [see:
MatXConfBuildInstallFailedTests-gcc-11_3-Cuda-11_8.log
] also results in much more compile errors for the tests as using gcc-12.2 / CUDA 12.1 (here only 2 of all tests fail).
I know that Debian Bookworm is still not stable but should be soon (nearly stable).
MatX will be a great tool for my numerical computations and I will wait for your further progress and have to thank you all for your efforts!

Best regards
Siegfried

Thanks @siegLoesch! We will look today and report back.

Hi @siegLoesch, unfortunately the compiler problem is not something we (MatX) can fix, but it's being addressed by the compiler team already for an upcoming cuda release. You are correct that gcc 12 is not officially supported by CUDA 11.8, but gcc 11 should be I believe. Are you sure something isn't wrong with the install in your gcc log? It appears that even though your compiler is gcc 11.3, it's still looking for libstdc++ in /usr/include/c++/12, when I think it should be in /11/.

Can you ensure that the compiler in the case of 11.3 is installed correctly?

Hi Cliff,
gcc 11.3 that was used is installed via apt package manager. Therfore I asume that it is installed correctly. The logged access to /usr/include/c++12 is mysterios. As far as know up to now this might be an issue with cmake in case of using non default CXX. I will check that with them.
Meanwhile I setup a fresh debian bullseye installation because on bookworm cuda 11.8 complains about using gcc 11.3 without using -allow-unsupported-compiler (message: no gcc later than 11 [whatever 11 means]). Bullseye has gcc 10.2.1 as default compiler and enabled me to use nvidia debian repos for driver and cuda. I compiled with two setups:

Both of them ended up with failures and, strange enough, with different ones. Libstdc++ is taken correctly also.

Based on that I will stay with bookworm and wait for the next cuda release.

Best regards
Siegfried

Thanks @siegLoesch, I will try to take a look at this today if I get time.

Hi @siegLoesch, can you please open iterator.h in and change line 114 to:

  #pragma GCC diagnostic push
  #pragma GCC diagnostic ignored "-Werror=aggressive-loop-optimizations"
      offset_++;
  #pragma GCC diagnostic pop

that will disable the warning. I don't know why it's happening yet since I need to reproduce it here, but that might get you a little further.

Hello @cliffburdick,
your proposal helped a lot. I had to sligtly modify it to be recognized from gcc without error:

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Waggressive-loop-optimizations"
       offset_++;
#pragma GCC diagnostic pop

Put in a patch (if someone wants to use it): iteratorH.zip

PLease note the following remarks:
Remark 1: The operating system used was debian bullseye, with g++-10, cuda-11.8 and python3.9. The cmake configure options were:

cmake --fresh -S . -B "$BUILD_DIR" -GNinja \
		-DCMAKE_BUILD_TYPE:STRING=Release \
		-DCMAKE_INSTALL_PREFIX="$INSTALL_PATH_PREFIX" \
		-DCMAKE_CXX_FLAGS:STRING="${cxxFlags[*]}" \
		-DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=YES \
		-DCMAKE_VERBOSE_MAKEFILE:BOOL=NO \
		-DCMAKE_PREFIX_PATH:PATH="$(printf %s "${cmakePrefixPath[@]}")" \
		-DCMAKE_FIND_PACKAGE_PREFER_CONFIG:BOOL=YES \
		-DCPM_USE_LOCAL_PACKAGES:BOOL=NO \
		-DMATX_EN_PYBIND11:BOOL=YES \
		-DMATX_EN_FILEIO:BOOL=YES \
		-DMATX_EN_VISUALIZATION:BOOL=YES \
		-DMATX_BUILD_EXAMPLES:BOOL=NO \
		-DMATX_BUILD_TESTS:BOOL=YES \
		-DMATX_BUILD_BENCHMARKS:BOOL=NO \
		-DMATX_BUILD_DOCS:BOOL=NO

Remark 2: I had to increase the version of cmake in use because the minimum version in the top CMakeLists.txt file is 3.18 whereas CPM uses 3.20.1 together with option FATAL_ERROR in file cmake/rapids-cmake/CMakeLists.txt. That inconsistency causes a configuration error for anyone using a cmake version >= 3.18 < 3.20.1! If neccessary I will open a bug report - let me know.

Remark 3: For the case of using the cmake option CPM_USE_LOCAL_PACKAGES:BOOL=NO, as indicated above, CPM downloads GTest that will be installed together with MatX and CPM seems to remember that (I have to mention that am not used to CPM). During any rebuild CPM is looking for GTest at install location and there is a target_link_libraries command for the matx_test binary in test/CMakeLists.txt. I had to change the entry gtest to GTest::gtest because the former target could not be found.
Some of the tests in 00_io need csv files and the binary test.mat which all are located at CMAKE_SOURCE_DIR/test/00_io. When calling the test executable matx_test from its location in CMAKE_BINARY_DIR/test the search paths according FileIOTests.cu are :

  • '../test/00_io/small_csv_comma_nh.csv' and
  • '../test/00_io/small_csv_complex_comma_nh.csv' respectively.

Therefore they must be copied to the correct location (done via cmake - see patch attached below).
cupy is needed for some tests and I've also done a short check for this dependency in file test/CMakeLists.txt (there is no default package in bullseye, installed it via pip). All changes mentioned with Remark 3 can be seen from this patch:
testCMakeLists.zip

Remark 4: Though applying above patches there are still tests failing:

[  FAILED  ] 13 tests, listed below:
[  FAILED  ] BasicTensorTestsAll/0.DLPack, where TypeParam = matx::matxHalf<__half>
[  FAILED  ] BasicTensorTestsAll/1.DLPack, where TypeParam = matx::matxHalf<__nv_bfloat16>
[  FAILED  ] BasicTensorTestsAll/2.DLPack, where TypeParam = bool
[  FAILED  ] BasicTensorTestsAll/3.DLPack, where TypeParam = unsigned int
[  FAILED  ] BasicTensorTestsAll/4.DLPack, where TypeParam = int
[  FAILED  ] BasicTensorTestsAll/5.DLPack, where TypeParam = unsigned long
[  FAILED  ] BasicTensorTestsAll/6.DLPack, where TypeParam = long
[  FAILED  ] BasicTensorTestsAll/7.DLPack, where TypeParam = float
[  FAILED  ] BasicTensorTestsAll/8.DLPack, where TypeParam = double
[  FAILED  ] BasicTensorTestsAll/9.DLPack, where TypeParam = cuda::std::__4::complex<float>
[  FAILED  ] BasicTensorTestsAll/10.DLPack, where TypeParam = cuda::std::__4::complex<double>
[  FAILED  ] BasicTensorTestsAll/11.DLPack, where TypeParam = matx::matxHalfComplex<matx::matxHalf<__half> >
[  FAILED  ] BasicTensorTestsAll/12.DLPack, where TypeParam = matx::matxHalfComplex<matx::matxHalf<__nv_bfloat16> >

The failure details (pointer missmatch) can be seen from the log file: MatXConfBuildInstall.zip. Maybe you have the time to check these failures and let me know.

Last but not least I carried over that knowledge to bookworm. This is still WIP because with python 3.11 (instead of python 3.9 with bullseye) all tests that need cupy complain about not being able to load the module.

Thanks and again and kind regards
Siegfried

Remark 2: Thanks! I will update our minimum version.

Remark 3: CPM is what we use internally for package management, so it will be using that if you build the tests. It looks like that test is failing to release the tensor properly after it goes out of scope. For now you can safely ignore those tests since you likely aren't using DLPack unless you're interfacing with Python. We will look into those and see if we can reproduce them.

Hi @siegLoesch I incorporated your CMake change for the files going to the binary directory here: adad5b3

I modified the cupy check to make it more generalizable and disable the test rather than not compiling it so it will be more granular. We still have not reproduced your DLPack issue. If we can't reproduce it soon I'll request more info from you.

Nice solution - thanks a lot!
Compiled and ran the unit tests with and without cupy on bullseye which yields the same result: only the dlpack tests fail.

Regards
Siegfried

Hi @siegLoesch, I found a real fix for the iterator error and will push it shortly. You can remove those pragmas once it's in. Still looking into the dlpack bug.

Hi @siegLoesch, all issues have been reproduced and fixed. Please give main a try and let us know if you have more issues.

Works like a charme, here is the result of my build script:
matx_test successfully passed all tests! .
Pretty good job - thank you!
I will proceed with your notebooks now and let you know!
Best regards
Siegfried