Speed-up compiler checks by parallelizing where possible

Question

Speed-up compiler checks by parallelizing where possible

nirbheek opened this issue 6 years ago · comments

Some compiler checks can be parallelized, for example:

get_supported_arguments()
get_supported_link_arguments()
cross_compute_int(): there are two branches at each guess, so we can run both at the same time, which will improve the worst-case time.

We should also add map-like API for batch-processing several has_function() and has_header() checks at once and sets values in a configuration_data() object, usually of the form:

cdata = configuration_data()
if cc.has_function(func_name, prefix: includes)
   cdata.set(func_name.to_upper().underscorify())
endif

Usually this requires lambda support (or similar) to be able to do this, but since the function in this case is almost always the same, we can hard-code it in Meson. It would be a good default in case we want to extend it later anyway.

Xavier Claessens · Answer 1 · Wed May 30 2018 02:01:16 GMT+0800 (China Standard Time)

One simple way to parallelize has_header() and has_function() is to add get_supported_headers() and get_supported_functions(). That would definitely speedup that kind of foreach loop: https://gitlab.gnome.org/GNOME/glib/blob/master/meson.build#L260.

Ran Benita · Answer 2 · Wed May 30 2018 03:57:31 GMT+0800 (China Standard Time)

I have copied the list of 47 headers from @xclaesse's link to an empty project, hacked in parallelism using concurrent.futures and timed meson setup. The results I get (laptop with 2 cores, minimum of multiple runs) are:

Same thread: 0m0.904s.
ThreadPoolExecutor(1): 0m0.932s.
ThreadPoolExecutor(2): 0m0.687s.
ThreadPoolExecutor(3) and up: same as 2 threads.

Xavier Claessens · Answer 3 · Wed May 30 2018 04:11:44 GMT+0800 (China Standard Time)

Could you run that same project but with cc.compile('#include <@0@>'.format(header)) instead of cc.has_header(), please? That could give some useful data for issue #2246.

Ran Benita · Answer 4 · Wed May 30 2018 04:21:43 GMT+0800 (China Standard Time)

Same thread: 0m1.401s.
ThreadPoolExecutor(1): 0m1.403s.
ThreadPoolExecutor(2): 0m1.008s.
ThreadPoolExecutor(3): 0m0.926s.

Jussi Pakkanen · Answer 5 · Wed May 30 2018 04:40:37 GMT+0800 (China Standard Time)

The problem here is that the result of most compiler checks is used immediately afterwards so there is not that many pending checks in flight at the same time. Parallelizing the potential new functions described by xclaesse seems easier (though you probably can't tell for real without measuring).

Ran Benita · Answer 6 · Wed May 30 2018 05:44:12 GMT+0800 (China Standard Time)

For reference, what I measured is an implementation of get_supported_headers.

Regarding whether parallelization is worth it, I am not sure. Since the checks are performed in an imperative fashion, there does not tend to be many batches to perform in parallel.

What is a good example of a meson project where the configure phase takes a long time? Maybe glib? I'd be interested to get a sense of how long meson setup of such a project takes.

Tim-Philipp Müller · Answer 7 · Wed May 30 2018 05:55:37 GMT+0800 (China Standard Time)

gst-build with --wrap-mode=forcefallback perhaps, but glib is also good on its own probably.

Nirbheek Chauhan · Answer 8 · Wed May 30 2018 18:51:12 GMT+0800 (China Standard Time)

hacked in parallelism using concurrent.futures
...
Regarding whether parallelization is worth it, I am not sure. Since the checks are performed in an imperative fashion, there does not tend to be many batches to perform in parallel.

concurrent.futures + ThreadPoolExecutor is going to have a huge overhead, so you will see some disappointing numbers.

The last time I investigated this, await + async subprocess seemed like it would yield much better outcomes because it's single-threaded in the interpreter, and allows us to spawn several processes at once and aggregate their results before returning from get_supported_headers().

I would recommend making an artificial test case that runs has_header() on 100 unique headers from /usr/include sequentially, and in parallel.

Ran Benita · Answer 9 · Wed May 30 2018 20:05:48 GMT+0800 (China Standard Time)

There is overhead. Perhaps not significant. Result of this gist:

Popen(1): 2.9786059856414795
Popen(2): 2.530482292175293
Popen(3): 2.346256732940674
Thred(1): 4.479013442993164
Thred(2): 3.3195695877075195
Thred(3): 2.823971748352051

Nirbheek Chauhan · Answer 10 · Wed May 30 2018 20:09:23 GMT+0800 (China Standard Time)

Might be significantly larger on slow systems like a raspberry pi. I think async/await will also look more elegant when implemented, but that remains to be seen. :)

Nirbheek Chauhan · Answer 11 · Wed May 30 2018 20:11:26 GMT+0800 (China Standard Time)

Result of this gist:

Actually, looking at the gist again, you don't use async await, you use a synchronous wait, so the Popen in the next loop won't be called till the previous ones finish. ~~I don't know if that's~~That's not representative of what I was suggesting because we lose some parallelism there. Not sure how much.

Ran Benita · Answer 12 · Wed May 30 2018 20:39:11 GMT+0800 (China Standard Time)

You are right of course. I've hacked the gist to do it correctly (I think). The results:

Popen(1): 2.926682710647583
Popen(2): 2.033327579498291
Popen(3): 2.030749797821045
Thred(1): 4.3277623653411865
Thred(2): 3.157244920730591
Thred(3): 2.8426740169525146

Nirbheek Chauhan · Answer 13 · Wed May 30 2018 20:57:43 GMT+0800 (China Standard Time)

I was actually thinking of the async version of subprocess which returns a coroutine, and then you can await on N of them, where N = num_cpus + 1.

It does something very similar to what you implemented, except that the Python overhead should be lower and you don't need to manually manage things.

But I think your numbers already show that this method should be faster. What do you think?

Ran Benita · Answer 14 · Thu May 31 2018 03:05:28 GMT+0800 (China Standard Time)

I've added an asyncio version to the gist.

The asyncio version seems to interfere with the others, so I split each of the versions and run them separately. Maybe it's an fd leak or some SIGCHLD problem.

Also the results differ between python3.6 and python master so I include results for both.

Python 3.6:

Popen(1): 2.4133994579315186
Popen(2): 1.6318774223327637
Popen(3): 1.592148780822754
Thred(1): 4.405722618103027
Thred(2): 3.226512908935547
Thred(3): 2.80383038520813
Async(1): 4.152508497238159
Async(2): 4.115453243255615
Async(3): 3.9369168281555176

Python master (05f1c8902c78dce66aed067444e2b973221bae2b):

Popen(1): 2.4290401935577393
Popen(2): 1.6083247661590576
Popen(3): 1.5807161331176758
Thred(1): 3.6656012535095215
Thred(2): 2.5956883430480957
Thred(3): 2.36556339263916
Async(1): 3.67629075050354
Async(2): 3.5747973918914795
Async(3): 3.4514098167419434

Note: I did make sure that concurrency works in the asyncio version (by replacing /bin/true with /bin/sleep 0.1).

Nirbheek Chauhan · Answer 15 · Thu May 31 2018 03:42:54 GMT+0800 (China Standard Time)

These are very interesting results, thanks for taking the time to do all that :)

I guess Popen.wait() is our best option!

Faith Ekstrand · Answer 16 · Mon Sep 03 2018 22:41:11 GMT+0800 (China Standard Time)

Scanning through the meson.build for mesa, I see a pile of checks of the form

if cc.compiles('int foo(void) __attribute__((__noreturn__));',
               name : '__attribute__((__noreturn__))')
  pre_args += '-DHAVE_FUNC_ATTRIBUTE_NORETURN'
endif

None of the cc.compiles() calls are dependent on each other; they all have to execute. If meson could do some sort of lazy evaluation thing where the effects of the if are delayed, allowing a bunch of the checks to run in parallel, that would seem to do the trick.

Jussi Pakkanen · Answer 17 · Tue Sep 04 2018 00:32:28 GMT+0800 (China Standard Time)

The problem there is that the result of the check is needed immediately after it is started. The interpreter can not proceed until the test is finished, thus the tests can not be interleaved.

Nirbheek Chauhan · Answer 18 · Tue Sep 04 2018 00:47:06 GMT+0800 (China Standard Time)

We could add compiler-esque reordering of code, but that's way too much overengineering. It's simpler to add new API to do the same job better.

Faith Ekstrand · Answer 19 · Tue Sep 04 2018 01:11:27 GMT+0800 (China Standard Time)

It's simpler to add new API to do the same job better.

Unfortunately, for large projects like mesa that need to be able to build on old distros, requiring new syntax for this sort of optimization means that we either need to bump our base meson version (not that easy to do) or we have to wait, possibly several years, before we can use it.

The problem there is that the result of the check is needed immediately after it is started. The interpreter can not proceed until the test is finished, thus the tests can not be interleaved.

Yes, that's a problem that needs solving.

We could add compiler-esque reordering of code, but that's way too much overengineering.

I don't think that compiling it into an IR and re-ordering instructions is really needed. I gave this a bit of thought and I think there's a far simpler solution that still allows for an in-order interpreter. The idea I had was as follows:

Choose some subset of things that can be handled lazily: string append, list append, and set add would make for a good start
Make the compiler check functions return some sort of token (I'll call it a LazyConditional for ease of terminology) instead of an actual boolean. The moment token.__bool__() is called, it stalls for the result and gives you the actual bool.
Add some sort of LazyConditionalExpression class which takes a LazyConditional and expresses any one of the expressions in 1. If any method other than a few specific to LazyConditionalExpression is called, it stalls, evaluates, and then behaves exactly like the thing it evaluated to. (Yay, python and duck typing let us do this pretty easily.)
When you see an if whose condition is a LazyExpression, you speculatively enter the if. If you ever encounter anything inside the if which cannot be lazily evaluated, you stall for the actual expression and bail out if it's false. If you encounter something which can be lazily evaluated, you create a LazyConditionalExpression in place of the actual expression and continue on.

With the above scheme, most of mesa's configuration as well as the header example from glib would just automatically be parallelized without any change to the meson syntax. Without knowing anything about the meson code-base, it also doesn't seem like it'd be all that much code. Just a thought, take it or leave it. 😄

Xavier Claessens · Answer 20 · Tue Sep 04 2018 06:31:29 GMT+0800 (China Standard Time)

I was wondering if we could have threads that jump straight to any compile/link/run checks, regardless of any conditionals, and if all args are literals it could exec it to populate the cache we already have. Then when the real interpreter arrive at that place the result will already be in cache.

It could do some simple variable handling, it does not have to be perfect, in worst case it does useless compilations and the good one will have a cache miss.

Dylan Baker · Answer 21 · Tue Sep 04 2018 23:49:07 GMT+0800 (China Standard Time)

@jekstrand's proposal sounds remarkably like asyncio.Future (or coroutines, if you prefer async/await).

I think some of mesa's slowness could be solved via me finishing #2393, and then maybe we could add caching for them, since they're purely compiler dependent.

Nirbheek Chauhan · Answer 22 · Wed Sep 05 2018 00:03:45 GMT+0800 (China Standard Time)

We benchmarked asyncio/coroutines above and found that the overhead is far too high. subprocess.Popen() which runs process asynchronously had the lowest overhead.

Dylan Baker · Answer 23 · Wed Sep 05 2018 00:12:40 GMT+0800 (China Standard Time)

coroutines work better the more of them you use by their nature. Setting off 4 coroutines and then waiting on them is going to be slow, that's what thread pools are for.

The most expensive things I see in mesa are:

checking for builtin functions
checking for gcc attributes

I think we can solve some of those with helper functions in meson, which could be cached. The same build of GCC will have the same set of builtin functions and __attribute__ support every time.

Mohammed Sadiq · Answer 24 · Fri Oct 19 2018 21:31:49 GMT+0800 (China Standard Time)

I see that cc.get_supported_arguments() checks if given arguments are supported one by one. But it seems like gcc and clang both emits warning for all unsupported arguments given.

So running clang/gcc twice (or thrice?) in total is enough to find all supported arguments, and the error/warnings can be parsed to get the unsupported arguments.

Eg;

gcc hello.c -Wcast-align -Wdeclaration-after-statement -Werror=address\
-Werror=array-bounds -Werror=empty-body -Werror=parenthesis -Wformat-nonliteral -mwindows

prints

gcc: error: unrecognized command line option ‘-mwindows’; did you mean ‘-m3dnow’?

Now, running gcc again, after removing the unsupported argument:

gcc hello.c -Wcast-align -Wdeclaration-after-statement -Werror=address\
-Werror=array-bounds -Werror=empty-body -Werror=parenthesis -Wformat-nonliteral

This gives:

cc1: error: -Werror=parenthesis: no option -Wparenthesis
cc1: warning: -Wformat-nonliteral ignored without -Wformat [-Wformat-nonliteral]

That's it. No matter how many arguments are to be tested, 2-3 invocations is enough to find the [un]supported arguments.

I don't know if there is a reason for not using this, may be there is something I don't know. But anyway, this seems to work for clang and gcc, at least for my cases.

rastermann · Answer 25 · Fri Nov 09 2018 07:25:41 GMT+0800 (China Standard Time)

Interesting that I found this thread.

We use meson now in many projects. EFL now has a meson build. On a very fast desktop it takes 9.7 seconds. That's after the patches that speed up the end phase submitted by Marcel (as a result of him doing the hard yards of getting EFL to have a meson build). Meson still spends an eternity at the end thinking about stuff though.

Paralellization would really be helpful. Almost all of our checks could be parallelized as they just check if some lib or dependency or header exists and then set a variable to true/false (or append to a string).

The same meson configure stage above on a Rpi3 takes a LOT longer: 224.4 sec. I also have a high-core-count ARM workstation (Thunder X2 - 256 cores) which actually struggles to keep up with the i7-7700k that does the above 9.7 seconds because it gets hung in single threaded parts of meson for far far far longer. The workstation catches up to the i7 quickly when it gets to compiling, but is held back by meson being single threaded.

So we'd absolutely love some paralellization even if it is new features we have to use and thus require a new meson version. Perhaps something as simple as some tagged blocks that you can wait for later on like just conceptually:

cpu_mmx = false
bob = async
  if compiler.check_header('immintrin.h') == true
    cpu_mmx = true
  endif
endasync
...
bob.wait()
if cpu_mmx == true
  config_h.set10('BUILD_MMX', true)
endif

The more async blocks we create, the more meson can queue in parallel in a work queue + thread/pipe+child exe pool, and then just the waits cause stalls.

Just an idea. Perhaps there are other ways. But just saying that this kind of thing would be massively useful and considerably improve build times. A cached build of EFL on the Rpi3 is spending ~22% of its time just in the meson configure stage right now (out of the whole configure, build and install). That could be easily halved of not cut down to 1/3 the time I think, so that would cut down total build time by 15% ... and that wouldn't be bad. In the ARM workstation I imagine the win would be even more dramatic possibly cutting down 20% of build time.

Jussi Pakkanen · Answer 26 · Sun Nov 11 2018 08:06:13 GMT+0800 (China Standard Time)

Whatever approach ends up taken, changing the Meson language to expose asyncness or something similar to the end user will not be accepted. The syntax must be as declarative as possible.

As an alternative approach we could have something like a batch lookup function:

results = cc.has_headers('foo.h', 'bar.h', 'baz.h', ...)
if results['foo.h']:
   do_something()
endif
# etc

Xavier Claessens · Answer 27 · Sun Nov 11 2018 08:13:22 GMT+0800 (China Standard Time)

+1 for batch api that returns dict. One easy case is parallelizing get_supported_flags().

Xavier Claessens · Answer 28 · Sun Nov 11 2018 08:18:02 GMT+0800 (China Standard Time)

Could be get_supported_headers() maybe.

rastermann · Answer 29 · Mon Nov 12 2018 19:23:10 GMT+0800 (China Standard Time)

I would say that a wait blah is about as declarative as if and for. it's saying that whatever is inside the wait is dependent on blah, thus indicating a dependency. explicitly creating N thread where each is procedural would not be declarative.

Tim-Philipp Müller · Answer 30 · Mon Nov 12 2018 19:33:47 GMT+0800 (China Standard Time)

Arguably this kind of thing could (and should) be done in Meson itself by building a dependency tree or such in the parser/interpreter instead of processing things on a line-by-line basis, rather than adding something to the language syntax?

Carsten Haitzler · Answer 31 · Mon Nov 12 2018 22:11:07 GMT+0800 (China Standard Time)

if meson did that then we'd be super happy as it's be zero changes to our build files... but i got the gist that that was going to be far too much involved work.

Xavier Claessens · Answer 32 · Mon Nov 12 2018 22:39:24 GMT+0800 (China Standard Time)

@tp-m I agree, but that sounds like extremely difficult to do, no?

Faith Ekstrand · Answer 33 · Mon Nov 12 2018 23:29:34 GMT+0800 (China Standard Time)

@xclaesse Perhaps, but I don't think it's intractable. You don't need a full compiler IR with code re-ordering; you just need a good async framework which kicks off the task immediately but doesn't wait for it until the result is actually needed. As long as such a framework is capable of handling the common cases of looking for a dozen headers or checking a bunch of flags, it'd already be a significant improvement.

Xavier Claessens · Answer 34 · Mon Nov 12 2018 23:36:20 GMT+0800 (China Standard Time)

@jekstrand sadly that would be too easy. Take for example this code in glib:

foreach h : headers
  if cc.has_header(h)
    define = 'HAVE_' + h.underscorify().to_upper()
    glib_conf.set(define, 1)
    glib_conf_prefix = glib_conf_prefix + '#define @0@ 1\n'.format(define)
  endif
endforeach

it won't be able to parallelize that code, because it must have the result for each iteration.

IMHO, by far the easier is to rewrite that kind of code to

foreach h : cc.get_supported_headers(headers):
    define = 'HAVE_' + h.underscorify().to_upper()
    glib_conf.set(define, 1)
    glib_conf_prefix = glib_conf_prefix + '#define @0@ 1\n'.format(define)
endforeach

Faith Ekstrand · Answer 35 · Mon Nov 12 2018 23:39:32 GMT+0800 (China Standard Time)

If the glib_conf set could handle a lazy conditional add (see description above) and only resolve the condition later when the actual set is needed, there wouldn't be a problem. (See my comment on Sep 3.) Admittedly, it's probably more complicated than I made it sound but I don't think it's intractable.

Xavier Claessens · Answer 36 · Mon Nov 12 2018 23:46:38 GMT+0800 (China Standard Time)

Yeah, so it quickly becomes much more complicated. Don't get me wrong, I agree that would be the ideal solution, but since that requires much more work, I fear it won't happen. I think a few batch APIs is easier, and could solve most of the performance issues. It is also nicer syntax in my example above.

A low hanging fruit would be to parallelize get_supported_args() that already exists as batch API.

Tim-Philipp Müller · Answer 37 · Mon Nov 12 2018 23:50:02 GMT+0800 (China Standard Time)

It wouldn't be able to parallelise that loop, but it might be able to parallelise other things from after the loop already in that case.

Edit: actually, it would be able to parallelise the checks here I think, just not the variable updates.

rastermann · Answer 38 · Mon Nov 12 2018 23:56:17 GMT+0800 (China Standard Time)

How about:

bob = async
  something async...
endasync
...
wait bob
  do something that relies on bob
  bob2 = async
    something else...
  endasync
endwait

wait bob2
  do something that relies on results of bob2
endwait

It's pretty declarative and clear what depends on what. It's giving dependencies that logic block B depends on the sync work in logic block A. The dependency tree can be resolved and put on work queues. It allows for any arbitrary amount of things to be done in parallel. Long term it would end up less work than the long tail of trying to make batch APIs for more and more things.

Dylan Baker · Answer 39 · Tue Nov 13 2018 03:02:23 GMT+0800 (China Standard Time)

I'm really nervous about adding async semantics to the meson build files, particularly because at that point we're going to allow users to write race conditions in their code, one of the best things about meson IMHO is that it forces you to write correct code due to its simplicity. Beyond that, writing any kind of async is hard, humans are really, really bad at doing work asynchronously. Personally, I'd much rather scope and consider implementing async internally than relying on users doing it right, this fits with meson's philosophy of "do it right once in core meson."

I'm also concerned about the the longterm maintenance burden of having a serial and batch interface for everything (as we would end up needing one for everything since some project would need it)

While writing and maintaining internally asynchronous calls might be more complicated than batch APIs, it would have several advantages:

It would be an implementation detail
It would make every meson build system faster
It wouldn't make the API more complicated

I think those 3 alone are reasons to consider what it would take to use some kind of async internally.

Jussi Pakkanen · Answer 40 · Tue Nov 13 2018 03:19:33 GMT+0800 (China Standard Time)

The problem with trying to do the async completely hidden is the one mentioned above: the default, natural way of writing Meson build definitions is to first check and immediately use the result. In order to write fast checks you need to "know" that is not the case and instead store the intermediate results and only use any of them after everything has been submitted. A batch api is nicer in that it is immediately readable and might even be less code than incremental implementation of the same in some cases.

A low hanging fruit would be to parallelize get_supported_args() that already exists as batch API.

Yes. Any volunteers? 😁

Faith Ekstrand · Answer 41 · Tue Nov 13 2018 03:26:26 GMT+0800 (China Standard Time)

The problem with trying to do the async completely hidden is the one mentioned above: the default, natural way of writing Meson build definitions is to first check and immediately use the result.

The LazyConditional approach I suggested on this thread back in September would, I believe, solve this problem. It's entirely possible I'm wrong and things are far more complicated than I think and it's intractable in reality. However, I have yet to see a single comment that has actually tells me why I'm wrong.

Jussi Pakkanen · Answer 42 · Tue Nov 13 2018 03:34:29 GMT+0800 (China Standard Time)

However, I have yet to see a single comment that has actually tells me why I'm wrong.

Let me quote myself from the previous comment:

the default, natural way of writing Meson build definitions is to first check and immediately use the result

That means that there is at most one background check running and the interpreter stalls almost immediately for its result since it is used almost immediately after submission. We would want to have many checks ongoing at the same time to get good parallelism.

That being said if we did transparent async sort of stuff, something like that is how it would be implemented.

Add usual caveat here about making performance assumptions without having personally measured it.

Dylan Baker · Answer 43 · Tue Nov 13 2018 03:35:40 GMT+0800 (China Standard Time)

I mean, the most common case I can think of for checks is:

if cc.has_header('foo')
  add_project_args('HAVE_FOO_H', language : ['c'])
endif

or

if cc.has_header('foo')
  conf.set('HAVE_FOO_H', 1)
endif

you could easily paralellize either of these using a Promise/Future interface, each caller just has to know that they're getting a Promise and wait.

Since I'm convinced this will work, and no one else seems to be, I'll hack something up so we can evaluate.

Jussi Pakkanen · Answer 44 · Tue Nov 13 2018 03:41:37 GMT+0800 (China Standard Time)

How would the second example benefit from async? You can't do anything in the if statement check until the result is actually there. That might even be slower than not doing async because you don't have to do any setup for async work.

Jussi Pakkanen · Answer 45 · Tue Nov 13 2018 03:44:25 GMT+0800 (China Standard Time)

I mean if you do this, then the async thing would work and improve things:

c1 = cc.has_header('c1.h')
c2 = cc.has_header('c2.h')
c3 = cc.has_header('c3.h')
c14= cc.has_header('c4.h')

if c1
  # set something
endif
if c2
  # set something
endif
if c3
  # set something
endif
if c4
  # set something
endif

But this one wouldn't:

if cc.has_header('c1.h')
  # set something
endif
if cc.has_header('c2.h')
  # set something
endif
if cc.has_header('c3.h')
  # set something
endif
if cc.has_header('c4.h')
  # set something
endif

Faith Ekstrand · Answer 46 · Tue Nov 13 2018 03:47:05 GMT+0800 (China Standard Time)

the default, natural way of writing Meson build definitions is to first check and immediately use the result

That means that there is at most one background check running and the interpreter stalls almost immediately for its result since it is used almost immediately after submission. We would want to have many checks ongoing at the same time to get good parallelism.

Which is why I said

When you see an if whose condition is a LazyExpression, you speculatively enter the if. If you ever encounter anything inside the if which cannot be lazily evaluated, you stall for the actual expression and bail out if it's false. If you encounter something which can be lazily evaluated, you create a LazyConditionalExpression in place of the actual expression and continue on.

A single level of speculative execution should be sufficient to gobble up 95% of the loops that people are likely to write.

Dylan Baker · Answer 47 · Tue Nov 13 2018 03:49:04 GMT+0800 (China Standard Time)

I have a commitment coming up shortly, so I'll be offline for a bit, but I'm currently hacking something up currently.

Jussi Pakkanen · Answer 48 · Tue Nov 13 2018 03:49:48 GMT+0800 (China Standard Time)

That massively complicates the implementation of an interpreter.

Faith Ekstrand · Answer 49 · Tue Nov 13 2018 03:54:41 GMT+0800 (China Standard Time)

That's entirely possible that the interpreter changes would be too deep. As someone who's never worked on the meson interpreter, I'm not really in a position to know what the changes would look like. I was mostly trying to suggest one potential solution (in theory) for how it could be done without changing meson syntax.

Jussi Pakkanen · Answer 50 · Tue Nov 13 2018 04:00:48 GMT+0800 (China Standard Time)

The problem comes from programs like these, which we should be able to handle:

if cc.has_header('foo.h')
  has = 1
else
  has = 0
endif

#do something with has

When you get to the statement following the if block you need to either:

track all possible variables that could have changed in any block, see if any of them are used in the current statement and stall for the results
unconditionally wait for all pending async operations to finish and get their results

The former is hideously complicated to implement reliably, which the latter means that you can not launch more than one async operation per if block (and you usually need the result of the check inside the block anyway).

Faith Ekstrand · Answer 51 · Tue Nov 13 2018 04:11:47 GMT+0800 (China Standard Time)

Right, I wasn't really considering trying to accelerate if statements with a non-trivial else. You would either have to speculatively execute both or have an IR and do transformations. Since you don't want to add an IR (I don't blame you), we can throw that one out. In the example you gave above, any speculative execution scheme is likely going to have a stall the moment they use out unless the interpreter goes to extreme heroics with conditional values. That said, how common is that case? I genuinely don't know. Most of the examples that have been given and most of what I've seen in mesa only have the if.

On the one hand, there's the pragmatic "accelerate what we can" which says that if 95% of if statements are one-level with no else then we've accelerated 95% of if statements. On the other hand, no one likes a performance cliff and someone is going to get grumpy if changing they way they write their meson a small amount makes giant differences in whether or not it can be parallelized.

scivision · Answer 52 · Sun Jun 02 2019 23:39:52 GMT+0800 (China Standard Time)

For #5418 I made a demo using asyncio-subprocess, actually running the compiler.
https://github.com/scivision/asyncio-subprocess-examples

I didn't yet reread the excellent discussion above on this issue to make the benchmark more realistic, but in contrast to the Gist, this test actually runs the compiler.

scivision · Answer 53 · Wed Jan 31 2024 03:47:09 GMT+0800 (China Standard Time)

I've updated the example of compiling 15 tiny test files -- like what get_supported_arguments() and similar would do internal to Meson.

https://github.com/scivision/asyncio-subprocess-examples?tab=readme-ov-file#benchmarks

The example compares asyncio, ThreadPoolExecutor, ProcessPoolExecutor with asyncio giving ~ 20x speedup on a 32-core Linux workstation, and ~4x speedup on an 8-core Apple Silicon laptop.