Number of threads versus parallel mating reproducibility

Question

Number of threads versus parallel mating reproducibility

afkummer opened this issue 3 years ago · comments

Alberto Francisco Kummer Neto commented 3 years ago

Good to see this project evolving!

So, I tried to keep our communication short but I think that was counterproductive. I have tried several implementations to guarantee the reproducibility of the parallel mating algorithm. This (old) commit contains the an implementation I devised on the past couple weeks. As you can see, it already includes the OpenMP flags schedule(static, 1), as you observed in your last message.

I did a git fetch of your new branch and run two short experiments with the TSP included in the repository.
First, I run ./main_complete -c config.conf -s 27001 -r I -a 999999 -t 60 -i ../instances/rd400.dat -n 4, and I got the following output:

(Only this run)

[Thu Dec  2 17:40:55 2021] Generating initial tour...
Initial cost: 19183

[Thu Dec  2 17:40:55 2021] Building BRKGA...
New population size: 2000

[Thu Dec  2 17:40:55 2021] Injecting initial solution...

[Thu Dec  2 17:40:55 2021] Initializing BRKGA...

[Thu Dec  2 17:40:55 2021] Evolving...
* Iteration | Cost | CurrentTime
* 0 | 19183 | 0.07
* 99 | 19182 | 6.80
* 108 | 19172 | 7.41
* 110 | 19171 | 7.55
* 125 | 19168 | 8.56
* 129 | 19141 | 8.84
* 150 | 19138 | 10.27
* 151 | 19137 | 10.34
* 162 | 19136 | 11.10
* 165 | 19133 | 11.32
* 181 | 19132 | 12.44
* 184 | 19127 | 12.64
* 187 | 19108 | 12.84
* 196 | 19103 | 13.46
* 212 | 19092 | 14.57
* 222 | 19086 | 15.26
* 265 | 19080 | 18.21
* 301 | 19078 | 20.70
* 363 | 19043 | 25.04
* 422 | 19034 | 29.46
* 525 | 19021 | 36.74
* 655 | 18996 | 45.77
* 707 | 18978 | 49.42
* 729 | 18953 | 50.92
* 741 | 18947 | 51.73
* 785 | 18916 | 54.75

[Thu Dec  2 17:41:55 2021] End of optimization

Then I change the number of threads to 8, and I run the algorithm two times. These are the two outputs.

Run 1

[Thu Dec  2 17:39:47 2021] Generating initial tour...
Initial cost: 19183

[Thu Dec  2 17:39:47 2021] Building BRKGA...
New population size: 2000

[Thu Dec  2 17:39:47 2021] Injecting initial solution...

[Thu Dec  2 17:39:47 2021] Initializing BRKGA...

[Thu Dec  2 17:39:47 2021] Evolving...
* Iteration | Cost | CurrentTime
* 0 | 19183 | 0.05
* 106 | 19181 | 5.69
* 122 | 19175 | 6.51
* 127 | 19173 | 6.76
* 141 | 19172 | 7.47
* 156 | 19162 | 8.20
* 160 | 19161 | 8.42
* 343 | 19159 | 17.44
* 380 | 19155 | 19.27
* 483 | 19150 | 24.35
* 513 | 19145 | 25.84
* 533 | 19143 | 26.83
* 601 | 19130 | 30.16
* 644 | 19098 | 32.37
* 672 | 19072 | 33.75
* 689 | 19069 | 34.61
* 700 | 19068 | 35.17
* 745 | 19046 | 37.36
* 754 | 19038 | 37.79
* 763 | 19016 | 38.25
* 774 | 18987 | 38.77
* 892 | 18969 | 44.59
* 911 | 18964 | 45.54
* 919 | 18942 | 45.93
* 930 | 18940 | 46.49
* 1112 | 18934 | 55.42
* 1155 | 18929 | 57.52
* 1161 | 18928 | 57.81
* 1175 | 18923 | 58.51
* 1189 | 18871 | 59.17
* 1195 | 18866 | 59.49

[Thu Dec  2 17:40:47 2021] End of optimization

Run 2

[Thu Dec  2 18:12:32 2021] Generating initial tour...
Initial cost: 19183

[Thu Dec  2 18:12:32 2021] Building BRKGA...
New population size: 2000

[Thu Dec  2 18:12:32 2021] Injecting initial solution...

[Thu Dec  2 18:12:32 2021] Initializing BRKGA...

[Thu Dec  2 18:12:33 2021] Evolving...
* Iteration | Cost | CurrentTime
* 0 | 19183 | 0.08
* 106 | 19181 | 5.47
* 122 | 19175 | 6.32
* 127 | 19173 | 6.62
* 141 | 19172 | 7.38
* 156 | 19162 | 8.12
* 160 | 19161 | 8.34
* 343 | 19159 | 17.88
* 380 | 19155 | 19.77
* 483 | 19150 | 24.86
* 513 | 19145 | 26.38
* 533 | 19143 | 27.43
* 601 | 19130 | 30.91
* 644 | 19098 | 33.09
* 672 | 19072 | 34.46
* 689 | 19069 | 35.31
* 700 | 19068 | 35.83
* 745 | 19046 | 38.08
* 754 | 19038 | 38.52
* 763 | 19016 | 39.02
* 774 | 18987 | 39.57
* 892 | 18969 | 45.93
* 911 | 18964 | 46.91
* 919 | 18942 | 47.33
* 930 | 18940 | 47.86
* 1112 | 18934 | 57.25
* 1155 | 18929 | 59.39
* 1161 | 18928 | 59.70

[Thu Dec  2 18:13:33 2021] End of optimization

So as you can see, both runs with eight threads are mostly equal until iteration 1161. Furthermore, notice that the "improvements" along the runs are the same, which is very good (and also what we expected). The problem here is when compare the any of the runs with 8 threads, and the one with 4 threads. As you can see, the reproducibility depends upon two things: then seed and the number of threads!

Do you think this is an issue?
Thanks!

Carlos Eduardo de Andrade · Answer 1 · Fri Dec 03 2021 12:08:55 GMT+0800 (China Standard Time)

Yes, indeed, we have a dependence on the seeds and the number of threads. And the reason is that we build an RNG for each thread. Therefore, runs with a different number of threads will have different number of RNGs with different states. I think it is OK if we state it very clearly in the documentation. So, the user will be aware of possible issues.

But, I have a second thought: you could combine your first and second suggestions together. We can allocate one RNG per thread a priori, as you did in the 1st version. Then, we use such RNGs in the mating using the seeding approach from the 2nd version (we may need to drop the warmup for performance). In this case, we eliminate the overhead to allocate a new RNG per loop iteration (but we still have the seeding overhead).

So, if you will, compare this 3rd with the first two. You may fix the number of generations to 1000 and run it for a small instance (burma or brazil). Run 3 times for each version, and compare the times.

My impression is that the 3rd version will be in the middle. If so, I think it is a better compromise because we don't need to expose the thread dependency to the user.

BTW, a quick tip: if you know you are reusing some data structure several times during the course of the program, just preallocate all memory you need in the beginning. This will save you a HUGE running time, avoiding asking memory to the OS at all times. Indeed, this is the RAII idiom recommended in C++.

Carlos Eduardo de Andrade · Answer 2 · Fri Dec 03 2021 12:09:32 GMT+0800 (China Standard Time)

Don't worry about long messages. It is better to be very explained and clear.

Alberto Francisco Kummer Neto · Answer 3 · Sun Dec 05 2021 07:15:52 GMT+0800 (China Standard Time)

@ceandrade , I have no comments about your implementation of the parallel mating. I did some tests with an hybrid implementation, as you suggested, and verified the impact in the heuristic performance. In the end, if we can explain how to reproduce the experiments (seed and # of threads), I think it is better to stick to your code since it is most performant.

Just curious about where you took the idea of initializing per-thread PRNG in "chain". For warm-up the generators, I think you can also use rng.discard(rng.state_size).

I still have a small concern about false sharing in some portions of the code that access concurrently sequential memory regions. For example here. That was the reason I was allocating memory within the parallel loop, contrary to your recommendations. Hopefully false sharing may not be a big problem for brkga_mp_ipr_cpp due to how a std::vector store data in memory through an internal pointer, but it heavily depends on how the OS allocates memory when constructing "xxx_per_thread" fields of BRKGA_MP_IPR.

As there is nothing to solve, I think I can close this thread. Just waiting for any comments you may have regarding these points I mentioned above.

Have a nice weekend!

Carlos Eduardo de Andrade · Answer 4 · Tue Dec 07 2021 23:06:27 GMT+0800 (China Standard Time)

The rationale behind the RNG chain initialization is to push different states on each RNG as much as possible but still controlled and reproducible. Note that if we use rng.discard(rng.state_size) (or any other number), for all generators (as we did in the previous versions), all generators would have the same state. We could also do something like rng.discard(rng.state_size * gen_id * t), where t is large. However, if one of the generators is called more than t times, we repeat states.

The chain initialization does not guarantee that we do not repeat states, of course. However, since the cycle of the Mersenne Twister is large (in our case $2^{(623)*32} - 1$, we may have a better chance than just fixing a t.

But the truth is that instead of using a single number as a seed, we should use an entire state as a seed. But this can be a little overkill for our case.

Now, I'm curious: is rng.discard(rng.state_size) enough to warm up the rng (at least, for our case)? Do you have some references for that?

I'll stop here. Let's discuss the RNG first, then we come back to false sharing.

Alberto Francisco Kummer Neto · Answer 5 · Fri Dec 10 2021 07:17:46 GMT+0800 (China Standard Time)

Note that if we use rng.discard(rng.state_size) (or any other number), for all generators (as we did in the previous versions), all generators would have the same state.

Sure, you still need to use some strategy (e.g. your chain approach) to seed the _per_thread generators. Definitely discard(state_size) does not replace that!

I saw this rng.discard(rng.state_size) in some C++ documentation few years a ago, when I was studying the new C++11 library and language constructions. My understanding is that, by discarding state_size, we put the generator in a entirely new state, which seems to be your objective with the rng.discard(10000) I saw in the earlier versions of your library.

This was my reference: https://www.cplusplus.com/reference/random/mersenne_twister_engine/discard/

But the truth is that instead of using a single number as a seed, we should use an entire state as a seed. But this can be a little overkill for our case.

I also agree that seeding a std::mt19937 with more than a single number is an overkill. According to some threads I saw in stackoverflow, the only "forbidden seed" for a Mersenne Twister generator is zero, but modern implementations (such as the C++ one) has additional mechanisms to circumvent this issue already.

Let's discuss the RNG first, then we come back to false sharing.

I also did some benchmark with linux-perf to check for first/last level data cache misses (those would evidence any false sharing issue), and I did not spotted any harmful behavior of the code. In any way, the data structures seems to be big enough so false sharing naturally cannot happen. To be sure, I would like to do a few more tests, but I have a deadline for the next week so I can only return to this tests by the end of the next week.

Congrats for the brkga_mp_ipr v2.0! 🎉🎉

Carlos Eduardo de Andrade · Answer 6 · Fri Dec 10 2021 13:17:54 GMT+0800 (China Standard Time)

Yes, I got you mean by rng.discard(rng.state_size). But even that, it may not be enough. This old paper tells us why. I do recommend reading, at least, the first part of it. It is really enlightening.

BTW, I have mentioned one of your papers in BRKGA 2.0. Shoot me an email if you want to cite any other (even it is still to come).

Alberto Francisco Kummer Neto · Answer 7 · Sat Dec 18 2021 05:28:55 GMT+0800 (China Standard Time)

Wow, I was not aware of these difficulties for initializing a PRNG, @ceandrade!

I did a skim read of the paper you indicated, and seems like that Mersenne Twister generators have some robustness against the issues pointed by Matsumoto et al. (At least in the implementation within glibc.)

I also checked NVIDIA CUDA reference API, hoping to find some hints on how to manage the initialization of a massive number of random generators. Unfortunately, their library cuRAND does not specify a standard method for seeding lots of generators. Even the official documentation uses time(NULL) in their examples. Some external references advise for simple LCG algorithms, mostly due to efficiency and ease of implementation on a GPU-based architecture.

Although I am aware of these problems, I do not have any idea to improve the code of brkga_mp_ipr_cpp w.r.t. PRNG management. Despite that, I did a lot of tests and I the evidence shows that the library is working fine.

I saw the reference for our GECCO'20 paper in the online documentation of the library. Thank you very much for that! We are also working in an improved version of that paper that uses the brkga_mp_ipr_cpp, so maybe you could update the reference in the future. By now, we are working in the revision of our paper.

As the original reason for opening this issue was solved, I am closing it.
Thank you again, @ceandrade!