InBetweenNames / gentooLTO

A Gentoo Portage configuration for building with -O3, Graphite, and LTO optimizations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Integration of Clear Linux patches

InBetweenNames opened this issue · comments

Clear Linux maintains a number of performance related patches for open source projects. There are quite a few: https://github.com/clearlinux-pkgs

It would be interesting to integrate these into GentooLTO somehow, either synced into the overlay or through a user install mechanism.

I didn't look thoroughly, but if I'm seeing it right, these are just patches in their git repo? It would be very easy to put the patches into /etc/portage/patches, is there some downside to this?

It appears that way, however I'm guessing they also fine tune compiler options as well. Also, I'm unsure if they use icc or gcc for their builds. I'm guessing it depends on the package. At the very least, we could include the source code patches, I think.

Interesting flag: -fno-semantic-interposition. I'll be adding this one to the defaults I think.

Other interesting things: they appear to enable -ffast-math on certain packages, or flags that -ffast-math enables at the very least. Also interesting: they force all functions to be aligned on 32 byte boundaries with -falign-functions=32. I wonder if they do that for AVX512 compatibility? It'd be interesting to see what the benefits are for overaligning functions, if any.

It appears that -falign-functions=32 has some kind of impact on autovectorization. By default on x86_64, it is set to 16.

I just tested an architecture that has AVX512 instructions:

> gcc -march=skylake-avx512 -flto -Ofast -Q --help=optimizer | grep falign-functions
  -falign-functions                     [disabled]
  -falign-functions=                    16

It seems by default on x86_64, -falign-functions is always set 16 (as long as -O2 or higher is specified).

Phoronix claims that they build using GCC/Clang (differs from package to package).

Makes sense - icc probably doesn't support a lot of the GNU extensions that are used out there in the wild. Not to mention, GCC is highly competitive with ICC when the right options are used. I'm thinking I may update the recommendations for -falign-functions, not adding it by default but mentioning it in make.conf.lto. The reason being, I want to support more than just Intel processors, or even just x86_64, and this feels very much like an Intel-specific thing.

I heard that also Solus is using some of their optimizations.

@InBetweenNames one thing that clear linux does that isn't really necessary for us is function multiversioning. I think the linker links different functions based on eg. AVX support. since everyone here is probably building with -march=native we can get smaller binaries than Clear can, may have better LTO opportunities, etc.

It would be interesting to see if we can get Michael to bench lto-overlay vs. Clear, esp. once we steal some of their fancy tricks.

regarding -falign-functions I'd expect this to be about aligned jumps/instruction cache line reads. maybe RIP relative addressing? but once you are executing in the function instruction alignment is going to be off no matter what you do thanks to variable length instructions. even if AVX cared about instruction alignment, I'm not sure this would help.

is it possible that there is a more compact way to load immediates if they are 32-aligned, so function call sites are smaller?

Agreed -- in fact we should do even better than function multi versioning since we're compiling our system exactly tailored for the system it's running on. This means more opportunities for LTO all around. Not to mention, this should be highly portable across many architectures.

If one could link their system using mainly static libraries, I bet the LTO benefits would be even more profound. You can link-optimize across static library boundaries, and you can't do that with shared objects. I don't believe this is possible as-is however, since Portage seems to really prefer shared objects, and configure scripts, etc, also prefer shared objects.

I was wondering the same about -falign-functions today, but there are other -falign-* flags that affect those other cases you mention. I hadn't considered immediate operands however. It might make a difference if there's some static storage for a function as well. I've been looking around all day for more uses of -falign-functions=32 and I've been having serious trouble. I found a slide deck:

http://hpac.rwth-aachen.de/teaching/sem-accg-16/slides/08.Schmitz-GGC_Autovec.pdf

I also found a StackOverflow question that indirectly touches on it:

https://stackoverflow.com/questions/19470873/why-does-gcc-generate-15-20-faster-code-if-i-optimize-for-size-instead-of-speed

If I pass g++ -O2 -falign-functions=16 -falign-loops=16 then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointer flag anymore. I can pass g++ -O2 -falign-functions=32 -falign-loops=32 or any multiples of 16, the code is not sensitive to that either.

Without delving in the GCC internals, I can't find many resources that recommend this flag. I'll see if the Intel guys will shed some light on it.

More goodies: -fno-common

Line 481: https://github.com/clearlinux/autospec/blame/master/autospec/specfiles.py

In that commit, -fno-common is mentioned.
Detailed here: https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html -- seems beneficial when it would work.

I find it interesting they enable -fno-math-errno, -fno-trapping-math by default. It's not -ffast-math, but it's partway there.

>gcc -march=skylake-avx512 -flto -Ofast -Q --help=common | grep fcommon
  -fcommon                              [enabled]

Even with the most aggressive optimization package, this is on by default.

I notice that on packages that are optimized for size, they enable -ffunction-sections and -fdata-sections for dead code removal, along with a -Wl,--gc-sections. However, these are two flags I want to research more before enabling by default -- I'm unsure how these interact with LTO. I assumed that LTO kind of did dead code elimination on its own, since the entire program would be visible at link time (minus definitions in shared objects).

After researching a bit more, it looks like -Wl,--gc-sections is a weak form of LTO, and it is often compared to full LTO like GentooLTO uses. I'm not sure if there's a benefit to using both at the same time.

https://lwn.net/Articles/741494/

OK -- so locally, I have enabled -fno-common and -fno-semantic-interposition and have started building a few packages with them. I'll try them out for a few days before pushing them. I've also emailed the Clear Linux developers about -falign-functions=32. If it turns out to be beneficial for some systems, I will add it as a recommendation but I won't enable it by default in the overlay -- it will be opt-in behaviour.

OK -- I think i figured it out:

https://software.intel.com/en-us/forums/intel-c-compiler/topic/635646

For more info:

https://lkml.org/lkml/2015/5/19/1009

It looks like the historical reason for -falign-functions=16 is:

The instruction fetch unit can fetch a maximum of 16 bytes of code per clock cycle

From Agner Fog's docs:

https://www.agner.org/optimize/microarchitecture.pdf

See "Instruction Fetch" sections for details.

However, consider that cache lines are usually 64 bytes long -- depending on your processor. From Ingar's post:

So based on those measurements, I think we should do the exact
opposite of my original patch that reduced alignment to 1 bytes, and
increase kernel function address alignment from 16 bytes to the
natural cache line size (64 bytes on modern CPUs).

As for why -falign-functions=32 was chosen? I have a feeling it's actually a compromise. See this reply by Linus: https://lkml.org/lkml/2015/5/19/1142

Is there some way to get gcc to take the size of the function into
account? Because aligning a 16-byte or 32-byte function on a 64-byte
alignment is just criminally nasty and wasteful.

So, for functions that are greater than the cache line size, aligning on a a cache line boundary makes the most sense. For functions that are less than the cache line size, this isn't ideal as it wastes I$ space. Of course, when inlining is taken into account, which is a much higher probability since we are using system-wide LTO, this whole discussion becomes moot. However, this is still a problem for shared objects.

Ideally, GCC/ld would be smarter about how it aligns functions.

Work has been done to this end: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240
It has been merged in trunk!

So, looking at -falign-functions once again:

Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance, -falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 aligns to the next 32-byte boundary only if this can be done by skipping 23 bytes or less.

In other words, -falign-functions=24 will align all functions to 32-byte boundaries except those that are 8 bytes in size or less.

And another goodie -flimit-function-alignment:

If this option is enabled, the compiler tries to avoid unnecessarily overaligning functions. It attempts to instruct the assembler to align by the amount specified by -falign-functions, but not to skip more bytes than the size of the function.

This flag is off by default!

Delving in the GCC source code in file gcc/config/i386/x86-64.h:

#define ASM_OUTPUT_MAX_SKIP_ALIGN(FILE,LOG,MAX_SKIP)                    \
  do {                                                                  \
    if ((LOG) != 0) {                                                   \
      if ((MAX_SKIP) == 0) fprintf ((FILE), "\t.p2align %d\n", (LOG));  \
      else {                                                            \
        fprintf ((FILE), "\t.p2align %d,,%d\n", (LOG), (MAX_SKIP));     \
        /* Make sure that we have at least 8 byte alignment if > 8 byte \
           alignment is preferred.  */                                  \
        if ((LOG) > 3                                                   \
            && (1 << (LOG)) > ((MAX_SKIP) + 1)                          \
            && (MAX_SKIP) >= 7)                                         \
          fputs ("\t.p2align 3\n", (FILE));                             \
      }                                                                 \
    }                                                                   \
  } while (0)

and the calling code in gcc/varasm.h:

...
#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
      int align_log = align_functions_log;
#endif
      int max_skip = align_functions - 1;
      if (flag_limit_function_alignment && crtl->max_insn_address > 0
          && max_skip >= crtl->max_insn_address)
        max_skip = crtl->max_insn_address - 1;

#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
      ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, align_log, max_skip);
#else
      ASM_OUTPUT_ALIGN (asm_out_file, align_functions_log);
#endif
    }

So in the worst case, we still get 8-byte function alignment for functions that are smaller than falign_functions in size. So, with the default, you get at most 16 bytes alignment and at least 8 bytes alignment with -flimit-function-alignment. It would probably make more sense to make it the L1 cache line size bytes by default and at least 16 bytes with -flimit-function-alignment. This is a pretty trivial change to make:

#define ASM_OUTPUT_MAX_SKIP_ALIGN(FILE,LOG,MAX_SKIP)                    \
  do {                                                                  \
    if ((LOG) != 0) {                                                   \
      if ((MAX_SKIP) == 0) fprintf ((FILE), "\t.p2align %d\n", (LOG));  \
      else {                                                            \
      fprintf ((FILE), "\t.p2align %d,,%d\n", (LOG), (MAX_SKIP));       \
        if ((1 << (LOG)) > ((MAX_SKIP) + 1))                            \
        {                                                               \
        /* Make sure that we have at least 16 byte alignment            \
           if > 16 byte alignment is preferred.  */                     \
          if ((LOG) > 4 && (MAX_SKIP) >= 15)                            \
            fputs ("\t.p2align 4\n", (FILE));                           \
        /* Make sure that we have at least 8 byte alignment if > 8 byte \
           alignment is preferred.  */                                  \
          else if ((LOG) > 3 && (MAX_SKIP) >= 7)                        \
            fputs ("\t.p2align 3\n", (FILE));                           \
        }                                                               \
      }                                                                 \
    }                                                                   \
  } while (0)

The above should guarantee the following, for a function that takes b bytes, with -falign-functions=n and -flimit-function-alignment:

  • If b >= n: for sure will be aligned to n
  • If n > 16 and 16 <= b < n: will be at least aligned to a 16 byte boundary
  • Otherwise, if n > 8 and 8 <= b < n: will be at least aligned to a 8 byte boundary

The check is done in this order to prevent wasting space.

So, it seems to me we should be using -falign-functions=${L1ICACHELINESIZE} -flimit-function-alignment.

I will test out my GCC patch and if it works OK, I will submit it upstream.

Heh, of course the GCC devs beat me to the punch.

gcc-mirror/gcc@bc9f52f

It looks like they are reworking the -flimit-function-alignment stuff in the next GCC version, given the commit message.

Thanks! In GCC trunk, we have this nice thing:

  {
    /* N2[:M2] is not specified.  This arch has a default for N2.
       Before -falign-foo=N:M:N2:M2 was introduced, x86 had a tweak.
       -falign-functions=N with N > 8 was adding secondary alignment.
       -falign-functions=10 was emitting this before every function:
      .p2align 4,,9
      .p2align 3
       Now this behavior (and more) can be explicitly requested:
       -falign-functions=16:10:8
       Retain old behavior if N2 is missing: */

So, we may be able to say something like -falign-functions=64:48:16:8 which should:

  • Align to 64 bytes if it can be done by skipping 48 bytes or less
  • Align to 16 bytes if it can be done by skipping 8 bytes or less

Obviously these values would need to be tweaked. But it would give the desired result at least.

This doesn't appear to be documented anywhere however.

Okay, it is documented and I simply didn't look hard enough. It's hard to retain the old behaviour with the new method, since the secondary alignment will only be triggered if -flimit-function-alignment is not passed in. Sigh.

Got a response from Arjan van de Ven!

without going into too many cpu microarchitecture details... Intel cpus like hot code to start at a 32 byte boundary.

Very interesting. So, Ingo's findings confirm this to a degree, and suggest even stronger alignment requirements are beneficial. He says it best here: https://lkml.org/lkml/2015/5/21/443
I think, with -falign-functions=n and -flimit-function-alignment we get 90% of the way there, actually:

  • Functions greater than n bytes are aligned to an n byte boundary
  • Functions less than n bytes are tightly packed, unless they will cross a n byte boundary

For fun, here's an attempt to restore the functionality in my previous patch on GCC trunk:

  /* Handle a user-specified function alignment.
     Note that we still need to align to DECL_ALIGN, as above,
     because ASM_OUTPUT_MAX_SKIP_ALIGN might not do any alignment at all.  */
  if (! DECL_USER_ALIGN (decl)
      && align_functions.levels[0].log > align
      && optimize_function_for_speed_p (cfun))
    {
#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
      int max_skip1 = align_functions.levels[0].maxskip;
      int max_skip2 = align_functions.levels[1].maxskip;
      if (flag_limit_function_alignment)
      {
        if (crtl->max_insn_address > 0
          && max_skip1 >= crtl->max_insn_address)
        max_skip1 = crtl->max_insn_address - 1;
        
        if (crtl->max_insn_address > 0
          && max_skip2 >= crtl->max_insn_address)
        max_skip2 = crtl->max_insn_address - 1;
      }
      ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
                           align_functions.levels[0].log,
                           max_skip1);
      ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
                           align_functions.levels[1].log,
                           max_skip2);
#else
      ASM_OUTPUT_ALIGN (asm_out_file, align_functions.levels[0].log);
#endif
    }

So, with m < n, -falign-functions=n:n:m:m -flimit-function-alignment, for a b byte function would:

  • If n <= b, will be at least aligned to an n byte boundary
  • If m <= b < n, will be at least aligned to a m byte boundary
  • If b < m, if the function would cross the boundary m, it will be aligned to m
  • Otherwise, will use target default function alignment (unknown what this is defined as in GCC, but I suspect for x86_64 it is either 8 or 16 -- if anyone knows please let me know). If this is 0, then it's tightly packed.

Examples for the above would be n = 64 or n = 32 and m = 16

Obviously such a scheme would need benchmarks to show it's worth doing over the default. It could potentially waste space, too, since a function with m <= b < n may align to an n boundary, instead of a potentially closer m boundary. It's too bad Ingo's scheme is too hard to implement in a quick patch, as I've love to test his out.

Regardless of whether the default schemes or the one I posted above is used, based on what we have seen, n should be either 64 or 32 for Intel processors, and we may or may not want to tightly pack small functions with -flimit-function-alignment. We'd need benchmarks to show for sure what's worth enabling, but I think it's safe to go with Arjan van de Ven's choice of -falign-functions=32 for Intel processors and not tightly packing functions in the meantime. I will update README.md accordingly.

As I find this issue to be very interesting, I'd like to leave it up for discussion, especially in the hopes we get some benchmarks using combinations of these flags. My diff against GCC trunk for my own alignment scheme is below, in case anyone wants to try it on GCC trunk:

diff --git a/gcc/varasm.c b/gcc/varasm.c
index 545e13fef6a..6ed87298ec9 100644
--- a/gcc/varasm.c
+++ b/gcc/varasm.c
@@ -1809,19 +1809,24 @@ assemble_start_function (tree decl, const char *fnname)
       && optimize_function_for_speed_p (cfun))
     {
 #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
-      int align_log = align_functions.levels[0].log;
-#endif
-      int max_skip = align_functions.levels[0].maxskip;
-      if (flag_limit_function_alignment && crtl->max_insn_address > 0
-	  && max_skip >= crtl->max_insn_address)
-	max_skip = crtl->max_insn_address - 1;
+      int max_skip1 = align_functions.levels[0].maxskip;
+      int max_skip2 = align_functions.levels[1].maxskip;
+      if (flag_limit_function_alignment)
+      {
+        if (crtl->max_insn_address > 0
+          && max_skip1 >= crtl->max_insn_address)
+        max_skip1 = crtl->max_insn_address - 1;
 
-#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
-      ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, align_log, max_skip);
-      if (max_skip == align_functions.levels[0].maxskip)
-	ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
-				   align_functions.levels[1].log,
-				   align_functions.levels[1].maxskip);
+        if (crtl->max_insn_address > 0
+          && max_skip2 >= crtl->max_insn_address)
+        max_skip2 = crtl->max_insn_address - 1;
+      }
+      ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
+                           align_functions.levels[0].log,
+                           max_skip1);
+      ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
+                           align_functions.levels[1].log,
+                           max_skip2);
 #else
       ASM_OUTPUT_ALIGN (asm_out_file, align_functions.levels[0].log);
 #endif

In addition to the function alignment, there's also data-alignment which takes a cacheline option:
-malign-data=cacheline

I also build my system with:
-mtls-dialect=gnu2

The Clear Linux "fast-math" options can be very beneficial to auto-vectorization, it gives many more opportunities than the default IEEE754 strict compliance.

@sjnewbury:
-mtls-dialect is a nice one!
-malign-data=cacheline too.

I got the impression that -malign-data, if changed, may not be compatible with code compiled with GCC 4.8 or older. Do you know if this one affects binary compatibility? If so, this would mostly affect closed source software, and possibly users of -bin packages. I see a number of recommendations for high performance code to use malign-data=cacheline, so this may be a non-issue at this time.

I've been hemming and hawing about the strict IEEE compliance myself, and I've decided we can support it as an opt-in enhancement. I know some users of this overlay are using it for scientific computations, and I don't want to interfere with that automatically.

One more thing: is malign-data documented in detail anywhere? I can't find much in the official GCC docs. I see references to Clear Linux using -malign-data=abi at one point. If necessary, I'll go look through the GCC code again.

I use the code from this bug report to benchmark -falign-functions: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58863
I just compile it with my cflags and then time ./align 32 32
When I see a difference I compile php 7.2 and run phpbench to see If I'm getting any benefit.
For a 7900X and gcc 8.2(with clear linux patches applied) -falign-functions=8 seems to be the fastest.

Also you should look into --param inline-unit-growth=5 --param max-unroll-times=2
http://hubicka.blogspot.com/2014/04/linktime-optimization-in-gcc-2-firefox.html
https://www.phoronix.com/forums/forum/software/programming-compilers/47966-intel-broadwell-gcc-4-9-vs-llvm-clang-3-5-compiler-benchmarks/page2
These two are still providing some benefit.
-funroll-loops should be set with --param max-unroll-times=2 to get the improvement

Well, I don't want to use -funroll-loops and friends because those override the compiler's judgement. Even when you add in --param max-unroll-times=2 --param inline-unit-growth=5, you're still telling the compiler to do something unconditionally. Certain packages may benefit, but the idea is we should be letting the compiler decide what to do. Hence why -falign-functions=32 is documented as being an optional thing for Intel chips, based on my conversation with a Clear Linux developer.

Now of course, we may want to enable these flags on a per-package basis where improvement has been proven via benchmarks (as with your php example). Otherwise, it should be the defaults with possibly a few optional tweaks on a per package basis.

Is -falign-functions=32 helpful for all intel CPUs, even old ones?

Anything since Sandy Bridge, I'm pretty sure.

One more thing I found:
https://github.com/clearlinux-pkgs/gcc

Clear Linux is actually using a patched GCC. The patches are small, but they affect codegen -- namely, autovectorization. I'm not sure how many of these patches are applicable to us, but I'm noting this here in case anyone wants to take a look.

The patches are working I'm using them but without the gentoo patches so just vanilla 8.2 with clear linux patches ontop. They are also using glibc patches which you could look into.

Excellent -- thank you for the suggestion. I'm considering either adding these to the LTO patch framework or forking the ebuilds with a clearlinux-gcc and clearlinux-glibc option. I'll give it some thought first.

Anything since Sandy Bridge, I'm pretty sure.

Maybe that should be specified in the readme, for example I have 2 intel machines using this overlay older than that.

@funghetto , I couldn't get much useful info out of the Intel team (EDIT: nevermind, I was just sent a reference in the Intel Optimization Manual). We know the Pentium Pro and derived architectures work best with 16-byte aligned jump targets and functions. However, since Core 2, things have changed internally in how instructions get decoded. We also know that Clear Linux uses -falign-functions=32 unconditionally, and according to their public documentation, the minimum requirement is a 2nd generation Core processor, which is Sandy Bridge. So, I figure Sandy is the minimum architecture where this flag has a benefit.

I'll see about putting this in the README.md at some point. It seems like this may be a good candidate for the Wiki eventually, since this is an evolving discussion.

https://en.wikichip.org/w/images/8/8c/intel-ref-248966-038.pdf See section B.5.7.3 here -- it looks like the 32 number comes from the instruction u-op cache. So, it should be used whenever there's a u-op cache.

Interesting, on wikipedia it says that even:
AMD implemented a μop cache in their Zen (microarchitecture).
Do you think it is relevant?

I looked into it:

Code that doesn't fit into the micro-operations cache run from the traditional code cache at a maximum rate of four instructions per clock. However, the rate of fetching code from the code cache is not 32 bytes per clock, as some documents seem to indicate, but mostly around 16 bytes per clock. The maximum I have seen is 17.3 bytes per clock. This is a likely bottleneck since most instructions in vector code are more than four bytes long.

From Agner's blog: https://www.agner.org/optimize/blog/read.php?i=839

So, for Zen, it seems that the default is still optimal. For Intel, I'm guessing that the 32-byte aligned function helps with the u-op cache stuff somehow. If I had to guess, the cache's lookup hardware prefers code sections that are aligned to 32-byte boundaries.

OK -- more research into malign-data from GCC itself:

  switch (ix86_align_data_type)
    {
    case ix86_align_data_type_abi: opt = false; break;
    case ix86_align_data_type_compat: max_align = BITS_PER_WORD; break;
    case ix86_align_data_type_cacheline: break;
    }

This is the key code in ix86_data_alignment that dictates how things get aligned.

So, from what I understand, GCC by default actually aligns certain objects more strictly than they need to be. This is still conforming to the psABI, but potentially wastes cache space. The idea, I think, is to retain compatibility with older GCCs. So, -malign-data=compat is this default. I suspect the use case here, given the use of MAX_OFILE_ALIGNMENT, is for re-linking code against object files generated with an older GCC. An important use case to be sure, but unlikely one that matters much for GentooLTO. In practice, objects (aggregates or arrays) larger than 32 bytes are aligned to a 32-byte boundary (half cache line width).

They also introduced -malign-data=abi, which tends to minimize alignment for objects while retaining compatibility with the psABI. Reference. It's the "bare minimum" alignment option.

The last option introduced is -malign-data=cacheline, which seems to optimally align for the target system's cache line size. This means that objects that are >= max_align in size get aligned to max_align and objects that are >= max_align_compat and < max_align are aligned to max_align_compat. In practice, this means objects that are >= 64 bytes in size are aligned to a 64 byte boundary, and objects that are >= 32 and < 64 bytes in size are aligned to a 32 byte boundary. It does this by going through the maximum compatibility alignment check before going through the maximum alignment check. Objects less than the maximum compatible alignment size are subject to additional alignment requirements (which is not done when using abi -- this is presumably for optimization).

The compat option also tries to optimally align smaller objects as well, but it will force aggregate types to align to the compatible alignment (in practice 32 bytes) instead of the cache line size (64 bytes).

So, essentially, abi reduces cache usage by packing more data in a smaller space, compat uses more cache, trying to align optimally while retaining compatibility -- but perhaps not aligning strict enough to justify the space wasted, and cacheline uses even more cache, but the idea is to prevent variables from spilling across cache lines. For example, if you have a 64-byte array, compat may align it to be half on one cache line and half on the next (since it looks like the maximum compatible alignment can be is 32 bytes -- it may be even smaller depending on the MAX_OFILE_ALIGNMENT macro), whereas cacheline would prefer to give that array it's own cache line so that accesses to elements within it will not cause a cache miss. abi would simply align it to the minimum, which is 16 bytes accordingly the x86_64 ABI, which of course runs into the same problem.

For users using -Os globally, -malign-data=abi seems to make the most sense. For -O3 (and beyond) users, cacheline seems to make the most sense. So, I'm testing -malign-data=cacheline locally. I'll recommend it as a default if things work okay for a week or two. I doubt it will cause problems.

One thing I never expected GentooLTO to do was make me learn the GCC internals!

A note about -mtls-dialect=gnu2 on x86:

https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg170586.html

It seems there aren't any testcases in GCC for this flag, which is why it's not default. So, I'll add this to the recommended list but I'll leave it out of the defaults for now. Interested users can add this flag manually. I'll probably add another section to the README to discuss these kinds of flags, or perhaps create a wiki page.

Indeed, there are a few packages that don't play nice with this flag, including glibc. However, that doesn't mean we can't include exclusions for it anyway, as an optional feature.

Indeed, there are a few packages that don't play nice with this flag, including glibc. However, that doesn't mean we can't include exclusions for it anyway, as an optional feature.

I just completed a server build with that flag, but without flag-o-matic overrides (VM host, will try next on test database servers and nodejs app servers). No problems so far. I'll report back here if I run into trouble.

CFLAGS="-O3 -pipe -march=haswell -fstack-check=specific -fno-plt -mindirect-branch=thunk -mfunction-return=thunk -flto=30 -fno-semantic-interposition -malign-data=cacheline -mtls-dialect=gnu2 -falign-functions=32 -fuse-linker-plugin -fipa-pta -fgraphite-identity -floop-nest-optimize"
CXXFLAGS="-O3 -pipe -march=haswell -fstack-check=specific -fno-plt -mindirect-branch=thunk -mfunction-return=thunk -flto=30 -fno-semantic-interposition -malign-data=cacheline -mtls-dialect=gnu2 -falign-functions=32 -fuse-linker-plugin -fipa-pta -fgraphite-identity -floop-nest-optimize"
LDFLAGS="-O3 -pipe -march=haswell -fstack-check=specific -fno-plt -mindirect-branch=thunk -mfunction-return=thunk -flto=30 -fno-semantic-interposition -malign-data=cacheline -mtls-dialect=gnu2 -falign-functions=32 -fuse-linker-plugin -fipa-pta -fgraphite-identity -floop-nest-optimize -Wl,-O1 -Wl,-z,now -Wl,-z,relro -Wl,--as-needed -Wl,--hash-style=gnu"

So far I stumbled upon three TLS relocation against invalid instruction errors with -mtls-dialect=gnu2 enabled.
btrfs-progs-4.19.build.log
emerge-info.txt
eudev-3.2.7.build.log
python-3.6.6-r1.build.log

Interesting. I didn't get any error using the GNU ld (BFD), but I'm getting an error too on eudev if I switch my linker to gold. Not reproducing the issue on python however.

Ah, of course switching the linker didn't occur to me. btrfs-progs and eudev indeed do build for me with -fuse-ld=bfd.

As for python, turns out I get the error only when combining gold and pgo.

This is all very good to know. Definitely will hold off on -mtls-dialect=gnu2 by default for a bit.

FWIW -mtls-dialect=gnu2 also requires a glibc patch to make it work with a patched prelink... Yeah, I'm the last user of prelink! ;-)

(Currently building gentooLTO+x32+auto-prelink+autopar+jemalloc)

Just a heads up, I'm considering creating some ebuilds with the Clear Linux patches in my overlay. I tested their kernel patches today to great success (reduced boot time by about 25%) so I will create an ebuild for at least the patched kernel and maybe some other packages, depending on how much success I have with them.

@aw1cks I see they change the perf_bias to default to performance instead of normal. How much does this account for? This setting makes a big difference on my Ivy Bridge laptop, but I have it set to toggle on power events to maximise battery life.

@sjnewbury I don't use it on a laptop. I haven't got round to rebuilding my laptop with gentoo yet, but on my desktop I didn't use all the patches but rather the ones relating to boot speed & performance without much regard for the patches claiming to reduce wakelocks. As far as I can tell, in their use case with Clear Linux the difference in power is more than offset by the other tweaks which they have made. How much of this extra battery life comes from their kernel, I couldn't say, seeing as they have custom patches applied to many userland applications and even gcc itself. You would have to benchmark it to know for sure. If you want to test the kernel yourself, you can use any 4.19 series kernel and put the patches into /etc/portage/patches/sys-kernel/${KERNEL_SOURCE_PKG_NAME}-${KERNEL_VERSION}/. The patches are available here (they also have their boot parameters in a text file in this repository, worth trying perhaps). Just as a note, I do hope to eventually create ebuilds which DO include their userland patches for various programs available as a useflag, and then in that case we can make a fair comparison. Additionally, I do wonder if the use of systemd vs openRC could play a role here (in my experience, I have had higher power consumption when using init systems other than systemd - maybe it's something I'm doing wrong, I couldn't tell you) . If you do find anything out, please let me know as I'm quite interested in this myself for my laptop.

@fenrus75 thanks for pointing in the right direction. How can I build this package without Clear Linux userspace tools? I can't find any binaries in the repository, nor any of the releases, and the Makefile references a file not included in the repository.

Great, thanks. However I'm having an issue with autoconf.

[alex@xps13 clr-power-tweaks-174]$ autoconf
configure.ac:6: error: possibly undefined macro: AM_INIT_AUTOMAKE
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
configure.ac:7: error: possibly undefined macro: AM_SILENT_RULES

Some missing include maybe?

Leaving a link to this new Phoronix post here, benchmarking some of the performance gains from Clear Linux, to give some extra context to this issue and get some idea of we can hope for from following up: https://www.phoronix.com/scan.php?page=article&item=clear-faster-blas&num=1

FWIW -mtls-dialect=gnu2 also requires a glibc patch to make it work with a patched prelink... Yeah, I'm the last user of prelink! ;-)

(Currently building gentooLTO+x32+auto-prelink+autopar+jemalloc)

im prelinking gentoo too with my new install gentoo nomultilib lto nopie nossp

@InBetweenNames Did you by any chance look into -fdata-sections -ffunction-sections -Wl,--gc-sections further? I've enabled those globally couple months ago and while I'm not sure there's a clean benefit (couldn't directly compare binary sizes with my previous build as I changed some other things as well), my system does not appear to be broken.

Even if it's not worth enabling globally, perhaps packages that can't be build with full LTO could benefit from something like /"${FLTO}"/"${GCSECTIONS}"?

So, let's make it straight: is -falign-functions=32 beneficial on AMD CPUs? Or only Intel?

I'm late to the party but I was having fun with your awesome LTO patches and various flags today, testing them with sysbench.

On my intel i7 7700k, -falign-functions=32 degrades sysbench cpu run results (total number of events) from around 15.5k to barely 13.9-14k. For comparison, I also did -falign-functions=8 and that resulted in around 14.3k result, so once again heavily downgrading the result, but to less extent than 32. I made triple sure I'm testing and interpreting stuff in correct way, emerging sysbench app (exclusively) after every change and running several times while ensuring everything in background is as silent as possible. The flags I've used were current gentooLTO as of today with only -march=native added, so implicitly -O3 and all lto/graphite optimizations.

Now I know this is one, very specific, maybe even a bit stupid benchmark which I used to test those flags, but it's definitely not universal to say that newer intels should use 32 globally. Maybe there are benchmarks or other apps where it's beneficial, I don't doubt that, but there is at least one (and from I read more than one) place where it heavily degrades the performance, so much that it degrades the result all the way to -O2 (without -march), which is clocking around 13.9-14k as well.

Just my 3 cents, maybe it'll help somebody, maybe it won't. I suggest running benchmarks to verify whether the flag is helping or not. Personally I dug it up due to the fact that after applying LTO flags the benchmark dropped by approx 10% compared to just -O2 -march=native, and I was looking for the cause - turns out it was -falign-functions=32. There is a chance that this benchmark could be flawed and it'd be exception rather than the rule, but I'd be very doubtful regarding that - once I get some time and motivation I might test other benchmarks just to compare the results.

@JustArchi Very interesting observation, I was contemplating about whether -falign-functions=32 is worth it for my i7-4790K today.

It may just be a fluke in the testing patterns of sysbench, but I guess even with all the research @InBetweenNames has done, it seems that this optimization really depends on a combination of workload and CPU-internal optimizations/alignment-/cache-assumptions.

Is -falign-functions=32 actually profitable? Some research lead me to: