Is there SIMD support?

Question

Is there SIMD support?

ntysdd opened this issue 2 years ago · comments

ntysdd commented 2 years ago

For example, use SIMD intrinsics explicity, or use long long to process 8 bytes together?

Ulya Trofimovich · Answer 1 · Fri Aug 12 2022 22:21:22 GMT+0800 (China Standard Time)

Not yet. I had some ideas about it, but it's not in the works yet.

ntysdd · Answer 2 · Mon Aug 15 2022 08:18:55 GMT+0800 (China Standard Time)

Thanks, that's good to know.

Ulya Trofimovich · Answer 3 · Sun Sep 18 2022 15:39:12 GMT+0800 (China Standard Time)

A note about alignment: since we cannot guarantee input data alignment, it would be impossible to use multi-byte reads (not without explicit guarantee from the user that the input is aligned). However, it should still be possible to combine multiple bytes into a 2/4/8-byte value and do the switch on this combined value rather than do 2/4/8 consequent switch statements. This optimization is not straightforward, as it the underlying DFA may not have many linear segments that can be combined this way (due to the grammar, or due to the possible end-of-input after each byte). This needs some study and experiments.

Perry E. Metzger · Answer 4 · Thu Sep 22 2022 22:37:14 GMT+0800 (China Standard Time)

I wonder if explicitly open coding the SIMD compiler intrinsics beats trying to emit constructs that the compiler can SIMD itself.

Ulya Trofimovich · Answer 5 · Fri Sep 23 2022 05:02:49 GMT+0800 (China Standard Time)

I don't think compiler can perform such optimization, as they require a bit of high-level insight.

Consider this simple regular grammar:

bool lex(const char* YYCURSOR) {
    const char* YYMARKER;
/*!re2c
    re2c:yyfill:enable = 0;
    re2c:define:YYCTYPE = char;

    "abcd" { return true; }
    *      { return false; }
*/
}

It has just two rules: either a string "abcd" or anything else. It would be easy for a human to read four bytes (as four 1-byte reads with arbitrary alignment) then combine into a single 4-byte value and compare it against the numeric value of abcd. Of course, it would be necessary to ensure that there are 4 byte in the buffer, so this optimization would only work with padding-based end-of-input checks.

Currently re2c generates the following "branchy" code (re2c 1.re -is -o 1.c, where -s is for compactness only and does not affect the reasoning):

bool lex(const char* YYCURSOR) {
    const char* YYMARKER;
{
	char yych;
	yych = *YYCURSOR;
	if (yych == 'a') goto yy2;
	++YYCURSOR;
yy1:
	{ return false; }
yy2:
	yych = *(YYMARKER = ++YYCURSOR);
	if (yych != 'b') goto yy1;
	yych = *++YYCURSOR;
	if (yych == 'c') goto yy4;
yy3:
	YYCURSOR = YYMARKER;
	goto yy1;
yy4:
	yych = *++YYCURSOR;
	if (yych != 'd') goto yy3;
	++YYCURSOR;
	{ return true; }
}
}

Which is compiled to very similar "branchy" assembly (g++ -O2 -c 1.c -o 1.o && objdump -d 1.o):

0000000000000000 <_Z3lexPKc>:
   0:	31 c0                	xor    %eax,%eax
   2:	80 3f 61             	cmpb   $0x61,(%rdi)
   5:	74 09                	je     10 <_Z3lexPKc+0x10>
   7:	c3                    	ret
   8:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
   f:	00
  10:	80 7f 01 62          	cmpb   $0x62,0x1(%rdi)
  14:	75 f1                	jne    7 <_Z3lexPKc+0x7>
  16:	80 7f 02 63          	cmpb   $0x63,0x2(%rdi)
  1a:	75 eb                	jne    7 <_Z3lexPKc+0x7>
  1c:	80 7f 03 64          	cmpb   $0x64,0x3(%rdi)
  20:	0f 94 c0             	sete   %al
  23:	c3                        	ret

Both GCC and Clang with -O2 generate almost identical code. And they cannot reorder the branches with the reads: this kind of optimization is too unsafe for a compiler to perform (at least in my limited understanding of C++ compilers).

Perry E. Metzger · Answer 6 · Wed Sep 28 2022 22:30:13 GMT+0800 (China Standard Time)

Precisely. This is why I suggest that using the compiler intrinsics is probably the correct path. clang, gcc, etc. support mostly the same set.

Ulya Trofimovich · Answer 7 · Fri Sep 30 2022 14:16:03 GMT+0800 (China Standard Time)

@pmetzger What intrinsics specifically do you mean? I don't see how an intrinsic can restructure the program and squash the four check-and-branch pieces in the example into one.

Perry E. Metzger · Answer 8 · Sat Oct 01 2022 08:56:34 GMT+0800 (China Standard Time)

You can play games like the one you're proposing with the use of intrinsics. They're gross and have limited portabilty, but the end user could specify whether they wanted the use of intrinsics or not.

Intrinsics also let you call SIMD instructions directly from generated C code. gcc and clang both support a wide variety of intrinsics. Here, for example, is some explanation of the vector extensions both support.

https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Vector-Extensions.html

https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors