ebiggers / libdeflate

The code here defines UNALIGNED_ACCESS_IS_FAST for ppc64 on Linux/BSD:

Lines 402 to 419 in 275aa51

    
           #if (defined(__GNUC__) || defined(__clang__)) && \ 
        
           	(defined(ARCH_X86_64) || defined(ARCH_X86_32) || \ 
        
           	 defined(__ARM_FEATURE_UNALIGNED) || defined(__powerpc64__) || \ 
        
           	 defined(__riscv_misaligned_fast) || \ 
        
           	 /* 
        
           	  * For all compilation purposes, WebAssembly behaves like any other CPU 
        
           	  * instruction set. Even though WebAssembly engine might be running on 
        
           	  * top of different actual CPU architectures, the WebAssembly spec 
        
           	  * itself permits unaligned access and it will be fast on most of those 
        
           	  * platforms, and simulated at the engine level on others, so it's 
        
           	  * worth treating it as a CPU architecture with fast unaligned access. 
        
           	  */ defined(__wasm__)) 
        
           #  define UNALIGNED_ACCESS_IS_FAST	1 
        
           #elif defined(_MSC_VER) 
        
           #  define UNALIGNED_ACCESS_IS_FAST	1 
        
           #else 
        
           #  define UNALIGNED_ACCESS_IS_FAST	0 
        
           #endif

But does not define it for ppc (any OS) and ppc64 on Darwin.

Is this intended or a mere omission? If this should apply to Big-endian PowerPC, then it is needed to add at least __POWERPC__ (Darwin both 32- and 64-bit) there and probably __powerpc__ (Linux 32-bit?).

I mainly focus on x86 and arm; the PowerPC support isn't something I've really worked on yet. From searching around it sounds like all PowerPC processors support unaligned memory accesses. Do you agree? I'm not familiar with the differences between the various PowerPC macros. I'm surprised that __powerpc64__ is OS-specific, as normally for these architecture macros it's the compiler that matters, not the operating system. In any case, if you think the code should be checking for something else, please go ahead and open a pull request with your proposed change, with an explanation.

@ebiggers Thank you for responding!

I mainly focus on x86 and arm; the PowerPC support isn't something I've really worked on yet. From searching around it sounds like all PowerPC processors support unaligned memory accesses. Do you agree?

AFAIK, yes, they do, including G4/G5. (I am not a hardware expert, but I can look for documentation on this.)

I'm not familiar with the differences between the various PowerPC macros. I'm surprised that __powerpc64__ is OS-specific, as normally for these architecture macros it's the compiler that matters, not the operating system.

PowerPC has OS-specific macros. I am not too sure which exactly are used for Linux, BSD and AIX (but I think at least AIX has its own unique), but I know for sure what Darwin uses (and, well, BeOS).
They may also be compiler-specific, oddly.

Adding __POWERPC__ (in caps) will include all Darwin (both 32- and 64-bit ABI), as well as BeOS. I have seen that some versions of Clangs define this on BSD too, though I cannot confirm that. But as long as general detection goes, no extra condition is needed.
However, for 32-bit Linux/BSD, I think, another macro needed (__powerpc__)?

AIX, apparently, uses __PPC: https://reviews.llvm.org/D108917

Apple code used a whole bunch of macros for PowerPC detection: https://opensource.apple.com/source/WTF/WTF-7601.1.46.42/wtf/Platform.h.auto.html

In any case, if you think the code should be checking for something else, please go ahead and open a pull request with your proposed change, with an explanation.

Thank you, I can do that.

P. S. If you will add optimized code for PowerPC, there we need to detect support for specific ISA. Specifically, more recent POWER CPUs have VSX support, while G4 and G5 have Altivec only, and some early POWER and G3 have none at all. Basic info can be checked here: https://en.wikipedia.org/wiki/Power_ISA (specific one is contained in IBM specifications docs, which are publicly available).

By the way, how could I test if adding unaligned access support for PowerPC actually works here or at least does not break anything or introduce any unwanted effects (assuming the build itself succeeds)?

By the way, how could I test if adding unaligned access support for PowerPC actually works here or at least does not break anything or introduce any unwanted effects (assuming the build itself succeeds)?

You could run the benchmark program to compare compression and decompression performance before and after.

If you want you can run the tests too, using ctest. But the code does not rely on UNALIGNED_ACCESS_IS_FAST being correct for correctness anyway; it just affects performance.

Sure, will do it now.

UPD. How long running benchmark is supposed to take by the way? It seems to do nothing: cpu is not loaded, process simply hangs like:

svacchanda@Sergeys-MacBook-Air ~ % /opt/local/var/macports/build/_opt_svacchanda_SonomaPorts_archivers_libdeflate/libdeflate/work/build/programs/benchmark -0
Benchmarking DEFLATE compression:
	Compression level: 0
	Chunk size: 1048576
	Compression engine: libdeflate
	Decompression engine: libdeflate
Processing standard input...

Here it is with 0, initially it was with default 6, but behavior is the same.
(This is on M1, where I expected it to be breezing fast. But same on PowerPC too.)

FWIW, here are tests on PowerPC. There is no meaningful difference, and time between runs with the same code can differ more that between some runs with different code (random variation obscures any deterministic difference).

Without patch for unaligned access:

Running tests...
/opt/local/bin/ctest --force-new-ctest-process 
Test project /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_archivers_libdeflate/libdeflate/work/build
    Start 1: test_checksums
1/8 Test #1: test_checksums ...................   Passed    0.12 sec
    Start 2: test_custom_malloc
2/8 Test #2: test_custom_malloc ...............   Passed    0.01 sec
    Start 3: test_incomplete_codes
3/8 Test #3: test_incomplete_codes ............   Passed    0.01 sec
    Start 4: test_invalid_streams
4/8 Test #4: test_invalid_streams .............   Passed    0.01 sec
    Start 5: test_litrunlen_overflow
5/8 Test #5: test_litrunlen_overflow ..........   Passed    0.04 sec
    Start 6: test_overread
6/8 Test #6: test_overread ....................   Passed    0.01 sec
    Start 7: test_slow_decompression
7/8 Test #7: test_slow_decompression ..........   Passed    0.01 sec
    Start 8: test_trailing_bytes
8/8 Test #8: test_trailing_bytes ..............   Passed    0.01 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) =   0.24 sec

With a patch:

Test project /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_archivers_libdeflate/libdeflate/work/build
    Start 1: test_checksums
1/8 Test #1: test_checksums ...................   Passed    0.13 sec
    Start 2: test_custom_malloc
2/8 Test #2: test_custom_malloc ...............   Passed    0.01 sec
    Start 3: test_incomplete_codes
3/8 Test #3: test_incomplete_codes ............   Passed    0.01 sec
    Start 4: test_invalid_streams
4/8 Test #4: test_invalid_streams .............   Passed    0.01 sec
    Start 5: test_litrunlen_overflow
5/8 Test #5: test_litrunlen_overflow ..........   Passed    0.05 sec
    Start 6: test_overread
6/8 Test #6: test_overread ....................   Passed    0.01 sec
    Start 7: test_slow_decompression
7/8 Test #7: test_slow_decompression ..........   Passed    0.01 sec
    Start 8: test_trailing_bytes
8/8 Test #8: test_trailing_bytes ..............   Passed    0.01 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) =   0.25 sec

But another run with the same patch:

Running tests...
/opt/local/bin/ctest --force-new-ctest-process 
Test project /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_archivers_libdeflate/libdeflate/work/build
    Start 1: test_checksums
1/8 Test #1: test_checksums ...................   Passed    0.11 sec
    Start 2: test_custom_malloc
2/8 Test #2: test_custom_malloc ...............   Passed    0.01 sec
    Start 3: test_incomplete_codes
3/8 Test #3: test_incomplete_codes ............   Passed    0.01 sec
    Start 4: test_invalid_streams
4/8 Test #4: test_invalid_streams .............   Passed    0.01 sec
    Start 5: test_litrunlen_overflow
5/8 Test #5: test_litrunlen_overflow ..........   Passed    0.05 sec
    Start 6: test_overread
6/8 Test #6: test_overread ....................   Passed    0.01 sec
    Start 7: test_slow_decompression
7/8 Test #7: test_slow_decompression ..........   Passed    0.01 sec
    Start 8: test_trailing_bytes
8/8 Test #8: test_trailing_bytes ..............   Passed    0.01 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) =   0.23 sec

benchmark does not seem to run for me (not just on PowerPC, but on aarch64), or I do something wrong with it.

Somewhat ironically, tests run slower – and much so – on M1:

--->  Testing libdeflate
Executing:  cd "/opt/local/var/macports/build/_opt_svacchanda_SonomaPorts_archivers_libdeflate/libdeflate/work/build" && /usr/bin/make test 
Running tests...
/opt/local/bin/ctest --force-new-ctest-process 
Test project /opt/local/var/macports/build/_opt_svacchanda_SonomaPorts_archivers_libdeflate/libdeflate/work/build
    Start 1: test_checksums
1/8 Test #1: test_checksums ...................   Passed    0.46 sec
    Start 2: test_custom_malloc
2/8 Test #2: test_custom_malloc ...............   Passed    0.22 sec
    Start 3: test_incomplete_codes
3/8 Test #3: test_incomplete_codes ............   Passed    0.22 sec
    Start 4: test_invalid_streams
4/8 Test #4: test_invalid_streams .............   Passed    0.22 sec
    Start 5: test_litrunlen_overflow
5/8 Test #5: test_litrunlen_overflow ..........   Passed    0.23 sec
    Start 6: test_overread
6/8 Test #6: test_overread ....................   Passed    0.22 sec
    Start 7: test_slow_decompression
7/8 Test #7: test_slow_decompression ..........   Passed    0.23 sec
    Start 8: test_trailing_bytes
8/8 Test #8: test_trailing_bytes ..............   Passed    0.23 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) =   2.02 sec

Or they just skip test cases on PowerPC, and result is misleading in this sense?

@ebiggers Going by documentation,

There is unaligned access support for 32-bit data. According to IBM:

The PowerPC takes a hybrid approach. Every PowerPC processor to date has hardware support for unaligned 32-bit integer access. While you still pay a performance penalty for unaligned access, it tends to be small.
On the other hand, modern PowerPC processors lack hardware support for unaligned 64-bit floating-point access. When asked to load an unaligned floating-point number from memory, modern PowerPC processors will throw an exception and have the operating system perform the alignment chores in software. Performing alignment in software is much slower than performing it in hardware.
https://developer.ibm.com/articles/pa-dalign/

There is no support of unaligned access for vector instructions: https://math-atlas.sourceforge.net/devel/assembly/vector_simd_pem.ppc.2005AUG23.pdf

FWIW, here are tests on PowerPC. There is no meaningful difference, and time between runs with the same code can differ more that between some runs with different code (random variation obscures any deterministic difference).

The ctest suite (which is separate from the benchmark program) only tests for correctness, not performance.

benchmark does not seem to run for me (not just on PowerPC, but on aarch64), or I do something wrong with it.

Perhaps you invoked it with no arguments, causing it to read its input file from standard input?

@ebiggers Is there documentation how to test it with benchmarks?

No. It's pretty straightforward, though; just build the benchmark program (-DLIBDEFLATE_BUILD_TESTS needs to be passed to cmake) and pass it some large-ish files. For what you're trying to test, the default options should be fine, but it's also possible to run ./build/programs/benchmark --help to see the different options that are available, e.g. setting the compression level.

	#if (defined(__GNUC__) \|\| defined(__clang__)) && \
	(defined(ARCH_X86_64) \|\| defined(ARCH_X86_32) \|\| \
	defined(__ARM_FEATURE_UNALIGNED) \|\| defined(__powerpc64__) \|\| \
	defined(__riscv_misaligned_fast) \|\| \
	/*
	* For all compilation purposes, WebAssembly behaves like any other CPU
	* instruction set. Even though WebAssembly engine might be running on
	* top of different actual CPU architectures, the WebAssembly spec
	* itself permits unaligned access and it will be fast on most of those
	* platforms, and simulated at the engine level on others, so it's
	* worth treating it as a CPU architecture with fast unaligned access.
	*/ defined(__wasm__))
	# define UNALIGNED_ACCESS_IS_FAST 1
	#elif defined(_MSC_VER)
	# define UNALIGNED_ACCESS_IS_FAST 1
	#else
	# define UNALIGNED_ACCESS_IS_FAST 0
	#endif

Unaligned access on PowerPC: forgotten macros in `common_defs.h`