shikokuchuo / mirai

mirai - Minimalist Async Evaluation Framework for R

Home Page:https://shikokuchuo.net/mirai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hanging tasks on Github Actions Ubuntu Runners (R CMD Check)

wlandau opened this issue · comments

As you know, I have been struggling with the final stages of ropensci/targets#1044, which integrates crew into targets. targets encodes instructions in special classed environments which govern the behavior of tasks and data. In R CMD check on GitHub Actions Ubuntu runners, when many of these objects are sent to and from mirai() tasks, the overall work stalls and times out. It only happens on GitHub Actions Ubuntu runners (probably Windows too, but it didn't seem worth checking), and it only happens inside R CMD check.

After about a week of difficult troubleshooting, I managed to reproduce the same kind of stalling using just mirai and nanonext. I have one example with 1000 tasks at https://github.com/wlandau/mirai/blob/reprex/tests/test.R, and I have another example at https://github.com/wlandau/mirai/blob/reprex2/tests/test.R which has 300 tasks and uses callr to launch the server process. In the first example, you can see time stamps starting at https://github.com/wlandau/mirai/actions/runs/4670004460/jobs/8269199542#step:9:105. The tasks get submitted within about a 20-second window, then something appears to freeze, and then the 5-minute timeout is reached. In the second example, the timestamps at https://github.com/wlandau/mirai/actions/runs/4670012640/jobs/8269219432#step:9:99 show activity within the first 8 seconds, and only 5 of the 300 tasks run within the full 5 minutes. (I know you have a preference against callr, but it was hard to find ways to get this problem to reproduce, and I think mirai servers can be expected to work if launched from callr::r_bg().)

Sorry I have not been able to do more to isolate the problem. I still do not understand why it happens, and I was barely able to create examples that do not use targets or crew. I hope this much is helpful.

I can see the work you've put into this and fully commend the effort! Is your hypothesis that the instances are running out of memory on the Github runners? I can imagine they'd be quite resource constrained.

On my Ubuntu laptop, the 1,000 task reprex showed 3.55GB of memory usage at the end vs the 300 task example, which only took up about 400MB and ran quite snappily.

I can see the callr example on the Github runner only used 80MB memory in the end, but maybe that instance only had that much RAM remaining - so I'm not quite sure what to make of this.

I can see the work you've put into this and fully commend the effort! Is your hypothesis that the instances are running out of memory on the Github runners? I can imagine they'd be quite resource constrained.

Thanks! I was trying to rule out memory as a possible explanation. I can see it may be a factor for https://github.com/wlandau/mirai/blob/reprex/tests/test.R. Maybe even https://github.com/wlandau/mirai/blob/reprex2/tests/test.R, but that seems less likely. I wonder how much memory the server daemon in the reprex2 branch is using during R CMD check.

I just pushed wlandau@30b6268 to take more memory readings, and memory usage on the dispatcher and server do not look very different from beginning to end.

I was thinking just in terms of the client actually. As it's all on one machine. I assume caching to disk is set up so it doesn't OOM but will likely slow to a crawl - and that may be what we are experiencing.

It seems also it is only on the Ubuntu runners. I ran the same 1000 task reprex on Windows/Mac and it seems to succeed there. With Mac I have the printouts: https://github.com/shikokuchuo/mirai/actions/runs/4672218165/jobs/8274151885
Couldn't figure out on Windows, but looking at the test timing it is well within the timeout.

https://github.com/shikokuchuo/mirai/actions/runs/4672298324/jobs/8274326971 just calls rnorm() but succeeds on Mac and fails on Ubuntu. On the face of it suggests that it is a memory issue or some other peculiarity of the runners rather than something about the complexity of the input/output objects.

From my tests, it really seems that it is just the fact that the Ubuntu runners are memory constrained.

Everything works for small tasks of rnorm(1e3) size: https://github.com/shikokuchuo/mirai/actions/runs/4675768366
But not for larger objects e.g. rnorm(3e5): https://github.com/shikokuchuo/mirai/actions/runs/4675934509

For the Mac machine the tests consistently run all the way through without problem.
The only command is rnorm() and the return value is just a vector of doubles so nothing complex going on.

Given the above, I don't believe this is a cause of concern.
Indeed you may wish to try your targets tests again, but ensuring the payloads are minimal.

Seems consistent with how much trouble I have been having when I peel back layers to isolate the cause. But it's still strange. In the case of targets, it does not take many tasks to hit the deadlock. Sometimes it only takes one, as in https://github.com/ropensci/targets/actions/runs/4670430552/jobs/8270146282. That workflow got stuck at https://github.com/ropensci/targets/blob/13f5b314cd4ac5c46f86551650b6f37fd54dffe4/tests/testthat/test-tar_make.R#L17-L31, so I ended up having to skip all the crew tests on Ubuntu R CMD check.

In all these tests, the data objects are small. An empty target object like the one in https://github.com/ropensci/targets/blob/13f5b314cd4ac5c46f86551650b6f37fd54dffe4/tests/testthat/test-tar_make.R#L17-L31 is only around 30 Kb, and each of the task data objects from https://github.com/wlandau/mirai/blob/reprex2/tests/test.R is only 20 Kb. I have been using clustermq in situations like this for several years and have never encountered anything like this, which made me suspect something about mirai is a factor.

If it is just R CMD check + Ubuntu + GitHub Actions + 1 GB memory, this is not so limiting. But it makes me worry about how this may affect data-heavy workloads on normal machines.

In targets, I just re-enabled crew tests on Mac OS. In the previous workflow without crew tests, the check time was around 9 minutes. With crew tests, check time was around 12 minutes. (It also appears to have trouble saving the package cache, but that is probably unrelated.)

If it is just R CMD check + Ubuntu + GitHub Actions + 1 GB memory, this is not so limiting. But it makes me worry about how this may affect data-heavy workloads on normal machines.

Honestly I'm not worried until I see it outside of Github actions. Our main production machine runs Ubuntu on 'data-heavy' workloads using mirai literally 24/7.

The tests vary on Ubuntu - sometimes it finishes only 1 task, sometimes up to 10 - so we can rule out a deterministic reason. It strongly suggests an external cause, which given we can't replicate the behaviour anywhere else, we can't diagnose.

In targets, I just re-enabled crew tests on Mac OS. In the previous workflow without crew tests, the check time was around 9 minutes. With crew tests, check time was around 12 minutes. (It also appears to have trouble saving the package cache, but that is probably unrelated.)

Yes, that's a good idea to enable on Mac. Should also work on Windows?

However, all your hard work does have a good result:

Seeing the memory usage on your 1,000 test case caused me to re-visit memory handling.

In fact I had attempted to optimise this a couple of weeks ago, but this led to the failures in the throughput tests (with the memory corruption errors). I am now taking a slightly different approach by simply removing the reference to the 'aio' external pointer when the results have been retrieved (and cached). This will allow the garbage collector to reclaim the resources in its own time.

If you have additional testing scripts, please can you re-run on nanonext 0.8.1.9016 and mirai 0.8.2.9023. I would like to make sure I haven't broken anything inadvertently.

I will also need to run this on a staging machine to monitor, but if successful should be in a position to release nanonext to CRAN at some point tomorrow. The recent builds have been really solid.

Thank you so much! Your updates to nanonext and mirai appear to have eliminated problems in the targets tests, which was what I was mainly worried about. The load tests in crew still appear to hang and time out, but because of the high loads in those tests, I am willing to believe that those remaining instances are indeed due to constrained memory on Ubuntu runners.

I am really excited for the next CRAN releases of nanonext and mirai. After those builds complete, I can release crew and then targets, broadcast our progress, and invite others to write their own crew launchers.

Thanks that's really good news!

I wasn't actually trying to fix the test issues... just a note before I forget - the 'aios' now pretty much clean up after themselves as soon as unresolved() returns successfully or after call_mirai() or after the $data is actually accessed. Note that it'd be a mistake to make a copy of an 'aio' before that point as you'd duplicate the reference to the external pointer. Just an FYI as I know you hang on to the 'aios'.

Thanks, noted. When crew checks a task, it uses the more minimal .unresolved() from nanonext, then moves the resolved mirai object from the list of tasks to the list of results. After that, the user can call controller$pop() to download the data and release the aio. From what I know about R's copy-on-modify system, I believe this approach avoids copying the pointer. At any rate, the tests I need to pass are passing, thanks to you.

Yes, actually I was being too conservative - the mirai are environments so if you copy them just the reference to the environment is duplicated. The external pointer sits within the environment, and is the same no matter through which reference you access it. And the cleanup works regardless. Forget my previous comment!!

FYI it appears I am still getting intermittent hanging tests on R CMD check. The good news is that it still works most of the time, so I have implemented timeouts and retries as a workaround. I think that's good enough for me for now. Maybe we could just keep an eye on it going forward?

For example, the workflow at https://github.com/ropensci/targets/actions/runs/4684600263/jobs/8300914514 reached a timeout in a crew test, but the first retry succeeded.

Maybe we could just keep an eye on it going forward?

Might be able to fix it these couple of days hopefully. I'm trying to pin down this elusive segfault on CRAN OpenBlas machine ( I mentioned before all these exotic setups tend to get me!!)

In the middle of simplifying and making more robust certain things in nanonext, which will push back the release slightly.

However, I've also cut out the new features we're not using yet so that mirai only depends on the existing nanonext 0.8.1. So should be able to release around the same time.

OK! Do you want to re-run your tests with nanonext fb8b05d v0.8.1.9020 and mirai 4e1862e v0.8.2.9027.

I have taken to testing across (a good selection of) the rhub platforms and the segfault no longer appears (was quite consistently, although randomly reproducible for the last CRAN release). The changes have a good chance of also fixing your hanging tests.

Sorry - not completely fixed - am on to it!

Thanks for working on this. I just pushed updates to crew and targets that use those versions of mirai and nanonext, and I look forward to trying more updates when you are ready.

I think nanonext v0.8.1.9020 or the latest v0.8.1.9022 should do the trick actually. crew always uses the dispatcher right? There are some edge cases without the dispatcher I need to make safe. But otherwise I think should be fine. Do let me know if anything seems to be off. Thanks!

With nanonext v0.8.1.9021 and mirai 0.89.2.9027, I did observe one hanging test which succeeded on the first retry: https://github.com/ropensci/targets/actions/runs/4690189013/jobs/8315468527#step:9:258. Another just like it ran on Mac OS: https://github.com/ropensci/targets/actions/runs/4690189013/jobs/8315467814#step:10:257. Currently trying again with nanonext 0.8.1.9022. Interestingly, I am seeing "read ECONNRESET" in the annotations at https://github.com/ropensci/targets/actions/runs/4690189013. Not sure if that is related.

With nanonext v0.8.1.9022, it looks like all the targets checks succeeded on the first try. I am running the jobs again to confirm.

Meant to say v0.8.1.9022 above.

In later commits of targets using mirai 0.8.2.9027 and nanonext 0.8.1.9022, I unfortunately still notice sporadic hanging. Examples:

nanonext 0.8.1.9025 and mirai 0.8.2.9028 are the release candidates. If final testing doesn't throw any errors, nanonext on track for release tmrw.

If you notice any particular occasions it hangs e.g. when scaling up / down etc. I will have more to go on. I have covered the general bases - it is much more robust (and actually more performant it seems).

Glad you're seeing performance gains, and I'm glad the packages are poised for CRAN. I'm afraid I do still see the same timeouts with nanonext 0.8.1.9025 and mirai 0.8.2.9028. The commits to https://github.com/ropensci/targets/tree/try-builds since ropensci/targets@8b4cf6e show intermittent failures testing crew. I disabled the retries on that branch so they would be easier to see.

That seems like it will be helpful - at the moment I'm not seeing a pattern, unless you can spot something.

In any case I've added more safety to mirai 0.8.2.9030. I think this could be the one that solves it. In any case, I think I'll be able to further refine through changes to mirai alone, and nanonext is good to go. No segfaults at least so it solves the CRAN issue - if you see any please let me know straight away. Thanks!

I am still noticing sporadic hanging e.g. https://github.com/ropensci/targets/actions/runs/4698586073/jobs/8331053726#step:9:249. I do not see a pattern either, but it only happens in a niche setting, and maybe more clues will arise a couple releases later. Thank you for all the progress you have made despite incomplete information from me.

I am sure you have seen this, but just in case you haven't, it looks like mirai 0.8.2.9030 is not compatible with nanonext 0.8.1.9026:

> packageVersion("nanonext")
[1] ‘0.8.1.9026> packageVersion("mirai")
[1] ‘0.8.2.9030> library(mirai)
> library(nanonext)
> daemons(
+   n = 1L,
+   url = "ws://127.0.0.1:5000",
+   disaptcher = TRUE,
+   token = TRUE
+ )
[1] 1
> while (is.null(nrow(daemons()$daemons))) {
+   msleep(10)
+ }
> socket <- as.character(rownames(daemons()$daemons))
> launch_server(url = socket)
> task <- mirai(TRUE)
Error in request(context(..[[.compute]][["sock"]]), data = envir, send_mode = 1L,  : 
  unused argument (socket = length(..[[.compute]][["sockc"]]) || stat(..[[.compute]][["sock"]], "pipes"))

Yes thanks you beat me to it. Was having connection issues here at the CMS in Cambridge so found a place up top with mobile reception. mirai v0.8.1.3031 is the corresponding version, just uploaded. It's good you're still tracking this. Would be great to get a final test run based on those versions before I release nanonext.

Behaviour should be the same, although the implementation is miles better. But you never know, it might solve the hanging...

Thanks! The intermittent hanging is still there, but I really appreciate the lengths you are going to refine nanonext and mirai.

nanonext has been submitted but is now in 'waiting' as I inadvertently broke mirai by removing a function I didn’t want to maintain going forward and forgetting to check using the old version! I did check your existing version of crew.

Edited: Back in business! Pressure on to deliver the next version of mirai. I think we're almost there, a few more tweaks...

The past few builds of mirai have been really quite solid, and I am prepared for a CRAN release. The latest dev version is 0.8.2.9036 0.8.2.9037.

I have added a 'force' argument to saisei() so you can rotate listeners even when there is a connection. Just in case it is useful for hanged process (it releases the task and re-assigns it - it doesn't lose the task any more as we are not terminating sockets). The result is the same as terminating the server process directly - but this is probably easier locally than if it were actually remote.

I'll give you time to test things thoroughly on Monday. I hope you are comfortable with where things are. I think I have done as much as I can at this stage and I'm sure real life usage reports will help narrow down the cause of any hangs.

Awesome! mirai 0.8.2.9037 and 0.8.2.9038 work just as well with crew and targets as before. Still intermittent hanging for those tests on R CMD check on Ubuntu runners, but the real tests that matter work really well. I am ready for a mirai release when you are.

I have added a 'force' argument to saisei() so you can rotate listeners even when there is a connection. Just in case it is useful for hanged process (it releases the task and re-assigns it - it doesn't lose the task any more as we are not terminating sockets). The result is the same as terminating the server process directly - but this is probably easier locally than if it were actually remote.

Thanks for this. I added saisei(force = TRUE) to the crew code that terminates "lost" workers that time out trying to launch, just for a little extra robustness.

Fantastic! There is one last thing that came to me whilst I was working on the above - and this is to do with NNG's retry mechanism, which we have 'magically' relied on without seeking to control directly.

This could well be responsible for hangs or other odd behaviour. In mirai 0.8.2.9039 I have tuned this so it won't try and resend to another socket. It will still retry on the same socket i.e. if an instance times out, listener is rotated etc. The rather critical reason being that currently it could try to send a long-running task multiple times to different servers!

There are other ways to gain even more granular control, but if the above works on your real tests then that would be ideal. The crew tests pass, but you probably need to run for targets etc.

Thanks!

Edit: the obvious advantage is that problematic code is automatically isolated - you can simply choose not to re-instate that particular server. You'll never run into the case that the same code proceeds to crash all your servers!!

Thanks for trying again. I saw hanging in 4 out of the last 16 targets GHA test runs, but hopefully that’s another clue.

Ok, sure we’ll have to leave that for another day then. But nothing else has broken? I just wanted to double check with you as it is a real change in behaviour - once a task gets sent to a socket it will now remain at that socket.

I just reran the throughput tests of crew and targets locally with mirai 0.8.2.9039 (as well as one of the throughput tests of the upcoming crew.cluster, still not open-source yet). All the results look good.

Btw. of the 4 test failures, 2 do not seem to be within class_crew - and then of the class_crew ones it is a different one each time.

There is also a tar_make() test that uses crew, and sometimes that is the only one which fails: https://github.com/ropensci/targets/blob/f5202dbe20f82922a630feede9714197bfb34e1c/tests/testthat/test-tar_make.R#L17-L47

Fantastic! All tests looking good on my side as well. Will leave it to run for a while on our workflow, and then release to CRAN later this evening.

Sounds great!

By the way, I have been printing out more diagnostic info from the crew controller. Many times, the worker is assigned a task, but it does not complete: https://github.com/ropensci/targets/actions/runs/4715544592/jobs/8362602856#step:9:278. Other times (fewer cases), daemons()$daemons is not a matrix even though the dispatcher is supposed to be running: https://github.com/ropensci/targets/actions/runs/4715546457/jobs/8362605843#step:9:225

More clearly printed examples of each are at https://github.com/ropensci/targets/actions/runs/4715910056/jobs/8363186686#step:9:242 and https://github.com/ropensci/targets/actions/runs/4715910056/jobs/8363186593#step:9:238, respectively. To be clear, I am still fine to simmer on this one and return to it if an idea occurs to you or we get more information from real-world usage.

Also: in those output logs, "daemons: NULL" just means daemons()$daemons was not an integer matrix. Most likely it returned 0 and the crew router object did not accept it and instead recorded NULL internally.

FWIW, if those log outputs are useful for future troubleshooting, you may want to pursue the output in your comment above, before GitHub clear the logs.

Thanks! Unfortunately it still only tells us what we already know which is that sometimes processes can hang or appear to hang.

First - just on the results:

  • the non-matrix return value is not a bug as it is behaving as expected. All NNG messages either return what you expect or an error value which can be tested with is_error_value(). In this case it will be a 5 timed out error. I realise that this isn't documented explicitly in this case so I will fix this.
  • the unfinished task - that is helpful in that we are now fairly certain that the process has just not completed the task - has hung or is so slow it appears to be hung.

Now to summarise where we are, this occurs:

  • only Github Actions Ubuntu and only up to ~25% of the time
  • even on Github Actions works for Mac and Windows
  • works on Ubuntu outside of GHA

Given the above, my only hypothesis at the moment is that due to the low amount of resources assigned to Ubuntu instances (which we saw on the early tests), that the kernel is simply held by other processes and not returned to us in a timely manner. targets tests are longer running and that is why it manifests there. Sometimes I am waiting for test results and I see the Github process hang up to 20mins or so on another step (outside of tests) so I believe this is entirely possible.

If you have other hypotheses (however unlikely) please let me know as otherwise we'll suffer from confirmation bias in looking to prove/disprove the above when it could be that we should be looking somewhere else entirely.

I suggest if you can reproduce on rhub we might look at it again, otherwise we can wait for real world issues. I have been using the below combination to debug the recent OpenBLAS issue (nothing to do with OpenBLAS - it just happened to trigger on that configuration):

rhub::check_for_cran(platforms = c("linux-x86_64-rocker-gcc-san", "fedora-clang-devel", "debian-gcc-devel", "debian-clang-devel"))

I consider this issue closed unless reproducible on rhub.

Also mirai 0.8.3 has just been released on CRAN (was held in a queue since last night)! :)

Amazing! I am so excited to update crew when the mirai builds complete on CRAN.

I will also try Rhub on the platforms you suggested for testing the targets/crew integration.

Exciting indeed! Again, it's been a pleasure :)

Changing tack, can you test if 70af31d 38e77e3 might have solved this?

I was reimplementing the solution to #32 using synchronisation 7992b03 when I had a brainwave. It if solves the problem there, it could also solve the problem here.

Once again, thanks for trying. I still do see a persistently unresolved mirai task in workflows like https://github.com/ropensci/targets/actions/runs/4752769291/jobs/8445327593#step:9:206.

I do believe this is now fixed via f303182 (v0.8.3.9027).
The mirai-only reprex completes every time: https://github.com/shikokuchuo/mirai/actions/runs/4915843147

Thanks so much for keeping this problem in mind. I have been thinking about it too, and over the past few days I have been trying to isolate it.

Unfortunately, tests on my end are still hanging and timing out using mirai 0.8.3.9027. https://github.com/ropensci/targets/actions/runs/4916196900/jobs/8779637216 is a large number of replications of the targets tests that are giving me trouble.

https://github.com/ropensci/targets/blob/mirai-with-targets-task/tests/test-mirai-with-targets-task.R is another reproducible that intermittently hangs (results: https://github.com/ropensci/targets/actions/runs/4916144545/jobs/8779524136). This one might be a more useful example because it calls mirai directly (although the task in .expr still uses targets). Here is the code:

library(mirai)
library(R.utils)
library(targets)
for (rep in seq_len(100)) {
  print(rep)
  target <- tar_target(y, 1)
  daemons(n = 1L, url = "ws://127.0.0.1:0", token = TRUE)
  url <- NULL
  while (!length(url)) {
    url <- as.character(rownames(daemons()$daemons))
  }
  launch_server(url = url)
  task <- mirai(
    .expr = targets:::target_run(target, globalenv(), "_targets"),
    .args = list(target = target)
  )
  withTimeout(
    while (TRUE) {
      Sys.sleep(1)
      data <- task$data
      if (inherits(data, "tar_target")) {
        break
      }
    },
    timeout = 60
  )
  daemons(n = 0L)
  if (!inherits(data, "tar_target")) {
    stop("task still unresolved after a minute")
  }
  rm(target)
  gc()
}

I can occasionally reproduce the hanging using a GitHub Codespace based created from https://github.com/ropensci/targets/blob/mirai-with-targets-task. I don't know how to share the image with you directly, but the branch has the Dockerfile and the .devcontainer/devcontainer.json.

Alternatively, you might also be able to explore the hanging by inserting a tmate step in https://github.com/ropensci/targets/blob/mirai-with-targets-task/.github/workflows/check.yaml. Although there it's a bit less convenient because there does not seem to be a way to open more than one terminal process.

There is also a test at https://github.com/ropensci/targets/blob/mirai-with-targets-task/tests/test-crew-with-targets-task.R which uses crew directly and only invokes targets inside the task.

Ok - these tests are failing every time. Are you able to run the original tests that corresponded to the mirai-only reprex you produced? Because those are now working consistently for me. The ones where we were getting 3 or 4 failures in 12 or 15 tests. I want to be clear what exactly what we're fixing - as I do believe the issue we had before has been fixed.

Ok - these tests are failing every time.

The original tests were failing intermittently, so I made the script loop over 100 reps.

Are you able to run the original tests that corresponded to the mirai-only reprex you produced?

Those original tests are still failing intermittently on GitHub Actions. In https://github.com/ropensci/targets/tree/debug-targets, I picked one of those problematic tests and looped it 100 times: https://github.com/ropensci/targets/blob/bb11a35e6b35fcf460d78b769a54355c2e87d343/tests/testthat/test-class_crew.R#L3-L58 (results: https://github.com/ropensci/targets/actions/runs/4916196900).

Because those are now working consistently for me.

Are you running them locally? They run without error on my local machine too. (Sometimes my Ubuntu installation hangs, but rarely.)

Ok - these tests are failing every time.

The original tests were failing intermittently, so I made the script loop over 100 reps.

Ok, I just wanted to make sure we're not chasing a moving target.

Because those are now working consistently for me.

Are you running them locally? They run without error on my local machine too. (Sometimes my Ubuntu installation hangs, but rarely.)

I'm referring to the results: https://github.com/shikokuchuo/mirai/actions/runs/4915843147
using your test code here: https://github.com/shikokuchuo/mirai/blob/dev/tests/tests.R
There are 2 sets of test runs that worked - that's why I thought the problem had been fixed.

Ok, I just wanted to make sure we're not chasing a moving target.

Yeah, I agree that's a good thing to check. Just to be sure, I created a fresh new branch with the tests closer to their original state: https://github.com/ropensci/targets/tree/debug-targets-original. Tests are running now: https://github.com/ropensci/targets/actions/runs/4916913580

I'm referring to the results: https://github.com/shikokuchuo/mirai/actions/runs/4915843147
using your test code here: https://github.com/shikokuchuo/mirai/blob/dev/tests/tests.R
There are 2 sets of test runs that worked - that's why I thought the problem had been fixed.

That's definitely progress! Odd that the targets tests still have trouble even though the mirai-only ones work now.

That's definitely progress! Odd that the targets tests still have trouble even though the mirai-only ones work now.

Thanks - I've run the tests 5x now - so 40 in total (35 on Ubuntu, 5 on Mac). So it does appear to address that particular situation.

The other tests will be useful in letting me know if I need to do more on this particular fix, or it's likely to be another issue altogether!

Yeah, I think the current issue may be different from the one you fixed (which is news to me). Looks like many of the jobs at https://github.com/ropensci/targets/actions/runs/4916913580/jobs/8781292763 are timing out.

Fantastic - this is very helpful!

mirai 3dbe12c v0.8.3.9028. This either solves these hanging issues or we know it is something else. Please have a re-run of all the tests.

FYI the issue I fixed was with the use of cv_reset() - as CVs are being modified asynchronously, we cannot be certain the order in which events occur. So to be extra safe, ensure it is not this causing any problems, I have eliminated this use entirely. All that this means is that 'instance' doesn't reset to zero when saisei() is called. This doesn't break crew tests - just to be noted in case you rely on this anywhere else - it is definitely not worth the potential problems for this tiny bit of state!

Sorry, but it appears 3dbe12c broke the semi-automated test in https://github.com/wlandau/crew/blob/main/tests/throughput/test-transient.R. crew relies on the instance counter to check if a worker is "discovered": https://github.com/wlandau/crew/blob/29fbcd53991022ac6fa6b596cd5b7e5eb805b9d8/R/crew_controller.R#L567-L576. Reverting to f303182 appears to fix that issue.

Sorry, but it appears 3dbe12c broke the semi-automated test in https://github.com/wlandau/crew/blob/main/tests/throughput/test-transient.R. crew relies on the instance counter to check if a worker is "discovered": https://github.com/wlandau/crew/blob/29fbcd53991022ac6fa6b596cd5b7e5eb805b9d8/R/crew_controller.R#L567-L576. Reverting to f303182 appears to fix that issue.

Will it break the hanging tests though? If it solves the hanging, we can find a workaround for this or vice versa.

Oh, right, I can check the hanging even though the transient worker tests are temporarily failing. I will do that.

Another thing we could try is to run the Dockerfile at https://github.com/ropensci/targets/blob/mirai-with-targets-task/Dockerfile locally with limited resources (i.e. --memory="4g" and --cpus="1").

Great - let's see what happens! I no longer think this is a resource issue btw. From what I've seen, and the seemingly stochastic nature of failures, I believe it is to do with the timing of asynchronous events that we reason to be in order, but actually occur out of order at times.

Comparing test results before and after, there looks to be noticeably less hanging (from 7 failures down to 2 failures out of 20 test runs). So this looks to be the right area to focus on. Very exciting!

Right! Given that some of the successful tests take > 40s, I wonder if they might pass if the time limit was set a bit wider from the 60s?

I will try that right now.

That time limit was only ever meant to guard against the 6-hour case anyway.

Hmm... looks as though there is still some hanging even with a 360-second timeout: https://github.com/ropensci/targets/actions/runs/4918659918/jobs/8785280957. In real life, I would not expect a pipeline so short to take as long as 6 minutes.

Hmm... looks as though there is still some hanging even with a 360-second timeout: https://github.com/ropensci/targets/actions/runs/4918659918/jobs/8785280957. In real life, I would not expect a pipeline so short to take as long as 6 minutes.

Yes agree, it looks like something else is going on here. But I think it is still an improvement on before, so I think it does guard against some of the hangs.

Ok, you have your previous 'instance' behaviour back in d56d042 v0.8.3.9029. This is just a counter (some arithmetic) and doesn't drive anything in mirai so I'm confident it doesn't break anything that has been fixed. The mirai-only tests are all working.

Thanks! Except for the hanging, which I am checking now just for information, all tests are passing on my end, including that transient worker test. Also, 0.8.3.9029 feels faster too, which is really nice.

Only 2 timeouts out of 20 runs this time: https://github.com/ropensci/targets/actions/runs/4919322001/jobs/8786791656

Given how much more actionable this issue is, do you think we should formally reopen it?

library(mirai)
library(R.utils)
library(targets)
for (rep in seq_len(100)) {
  print(rep)
  target <- tar_target(y, 1)
  daemons(n = 1L, url = "ws://127.0.0.1:0", token = TRUE)
  url <- NULL
  while (!length(url)) {
    url <- as.character(rownames(daemons()$daemons))
  }
  launch_server(url = url)
  task <- mirai(
    .expr = targets:::target_run(target, globalenv(), "_targets"),
    .args = list(target = target)
  )
  withTimeout(
    while (TRUE) {
      Sys.sleep(1)
      data <- task$data
      if (inherits(data, "tar_target")) {
        break
      }
    },
    timeout = 60
  )
  daemons(n = 0L)
  if (!inherits(data, "tar_target")) {
    stop("task still unresolved after a minute")
  }
  rm(target)
  gc()
}

Btw. the above test you suggested, I just tried it on Github here: https://github.com/shikokuchuo/mirai/actions/runs/4919282417 and all tests pass.

I changed it to a slightly more efficient version of the same test that uses a mirai with timeout and call_mirai() in place of the polling, and got one failure: https://github.com/shikokuchuo/mirai/actions/runs/4919433912

Again, this one seems it is 'almost fixed'.

Only 2 timeouts out of 20 runs this time: https://github.com/ropensci/targets/actions/runs/4919322001/jobs/8786791656

Given how much more actionable this issue is, do you think we should formally reopen it?

That's good. I think I've taken it as far as I can along this line of thinking, and it has yielded a definite improvement. I am trying something else in v0.8.3.9030 - package testing on all platforms at the moment.

Awesome! I'm noticing a huge improvement on my end too, only one timeout at https://github.com/ropensci/targets/actions/runs/4920543195 using mirai 0.8.3.9031.

Using the simple test code here: https://github.com/shikokuchuo/mirai/blob/dev/tests/tests.R
I am getting a pass on all except one https://github.com/shikokuchuo/mirai/actions/runs/4925269509/jobs/8799289113 - which really hangs (does not time out as it's supposed to).

It's quite puzzling as I've walked though the code and nowhere should it hang (those places have timeouts implemented).

The only thing that is 'wrong' with the test code is this spin cycle:

while (!length(url)) {
    url <- as.character(rownames(daemons()$daemons))
  }

I am wondering if in crew/targets usage daemons() is ever called that often? I don't think this is an issue as NNG can handle much higher throughput than R can generate. But I wonder...

crew does daemons() in a couple places: in particular, when it starts the dispatcher and makes sure it is running. I can try making this polling interval more gentle in the tests.

And come to think of it, targets does its own polling. I might have to lighten that too, just for testing.

Hmm...GitHub appears to be malfuncitoning today, which will delay my progress on this.

crew does daemons() in a couple places: in particular, when it starts the dispatcher and makes sure it is running. I can try making this polling interval more gentle in the tests.

I see your poll method has a seconds_interval argument so should be reasonable in real life.

And come to think of it, targets does its own polling. I might have to lighten that too, just for testing.

Just to be clear, I'm not saying make the tests less stringent than real life. But then it's probably not necessary to test clear anti-patterns as opposed to still likely corner cases.

Hmm seems like a case of quantum mechanics - as soon as I instrument the tests to find where it's hanging afe22c7, it no longer hangs: https://github.com/shikokuchuo/mirai/actions/runs/4926527459. I'll keep trying.

I think I've isolated the issue in this run: https://github.com/shikokuchuo/mirai/actions/runs/4926942122/jobs/8803210469. It triggers when I remove the cat() line between the following 2 commands. So it might be that there needs to be an infinitesimal gap of time after a server connects to launching a task. I'm afraid I can't sync the daemon launch as it connects to dispatcher rather than the client.

launch_server(url = url)
task <- mirai(
    .expr = targets:::target_run(target, globalenv(), "_targets"),
    .args = list(target = target),
    .timeout = 60000L
  )

So the above is using launch_server(), but the equivalent processx command, you might want to try adding a delay or something as a stopgap see if it solves the hanging. Then we can think if there is a more optimal solution.

Thanks for tracking that down. For testing purposes, I will try to make these steps synchronous:

  1. Launch the worker.
  2. Wait for it to connect.
  3. Launch the task.

Related: I made improvements to polling in both targets and crew today, but unfortunately those don't seem to fix the hanging: https://github.com/ropensci/targets/actions/runs/4929468926/jobs/8809141804

Thanks for tracking that down. For testing purposes, I will try to make these steps synchronous:

1. Launch the worker.

2. Wait for it to connect.

3. Launch the task.

Sorry - my bad - the hanging triggered when I removed that cat() line, however looking at the actual test print results - it hangs during that daemons() poll loop https://github.com/shikokuchuo/mirai/actions/runs/4926942122/jobs/8803210469#step:7:209. Let me investigate further.

Thanks! I tried #53 (comment), but it looks like there is still hanging: https://github.com/ropensci/targets/actions/runs/4929755261. Trying the same thing again at https://github.com/ropensci/targets/actions/runs/4929798098 but with a 100ms delay after detecting a connection.

Yes, it seems to be that poll loop which is problematic: https://github.com/shikokuchuo/mirai/actions/runs/4929764794/jobs/8809815414#step:7:673 You can see that it always stops at 'daemons set', it never reaches the next marker which is 'got status', so it hangs during that loop.

This is after I decreased the polling frequency to 0.5s. So it can't be the case that messages are being dropped or anything like that. The receive also has a timeout built in. It's not obvious so I'll have to mull over this one.

But if there is a way to work around having to poll in the first place, I'd definitely go for that.

Sounds like a tough problem, but it's really encouragning how closely you have nailed down the problem.

An update: unsurprisingly, my version of #53 (comment) with the 100ms delay still showed hanging tests: https://github.com/ropensci/targets/actions/runs/4929798098.

I believe I have it! As I mentioned elsewhere, in mirai 0.8.4, the status matrix is constructed by the client, not sent as an object by dispatcher. This means that the server URLs are actually cached by mirai whenever the bus connection is established / saisei() is called.

Hence to find out the URL, you don't need to query daemons() at all - the URLs are stored as a character vector under the relevant .compute environment at $urls e.g. here: https://github.com/shikokuchuo/mirai/blob/dev/tests/tests.R#L11 https://github.com/shikokuchuo/mirai/blob/dev/tests/tests.R#L9

There is no public interface at present, but I am sure you can implement a workaround for now. I can make one available in the next release (am afraid the current update is already in a CRAN queue).

It seems to work in this test run with 0.1s between polls: https://github.com/shikokuchuo/mirai/actions/runs/4930262246 https://github.com/shikokuchuo/mirai/actions/runs/4930394270 - There is no need to poll! The URLs are written as part of setting daemons.

At a high level, I can see how this fixes the example from #53 (comment). And it's extremely helpful that you have isolated the problem. However, crew and its users frequently poll daemons() for other reasons besides discovering the URLs.

  1. When a crew controller starts, it polls daemons() until it receives confirmation that the dispatcher is running (or it times out). This still seems like a necessary step before launching servers because crew needs asyncdial = FALSE. (I forget where we talked about asyncdial = FALSE, but I think crew needs it in order to avoid the case of orphaned servers that try to connect after the startup window expired and the socket already rotated with saisei().)
  2. Every call to controller$pop() auto-scales the workers, which is important for handling transient workers with a long backlog of tasks. Every auto-scaling operation needs to poll daemons() again in order to check the online and instance fields to see which workers are connected, discovered, and lost. In targets and Shiny, these calls to controller$pop() need to happen several times a second in order to be efficient.

At a high level, I can see how this fixes the example from #53 (comment). And it's extremely helpful that you have isolated the problem. However, crew and its users frequently poll daemons() for other reasons besides discovering the URLs.

I realise I've just given you the answer without the explanation! To be clear, after the further investigation yesterday, I've found no issue whatsoever with polling. The precise issue was calling for a daemons() status immediately after setting daemons, which fails in possibly 1 in 20 cases on Github. I have yet to reproduce this locally.

1. When a `crew` controller starts, it [polls `daemons()` until it receives confirmation that the dispatcher is running](https://github.com/wlandau/crew/blob/d92d2c4e23cd737429aa718393875033d77da5c2/R/crew_router.R#L231) (or it times out). This still seems like a necessary step before launching servers because `crew` needs `asyncdial = FALSE`. (I forget where we talked about `asyncdial = FALSE`, but I think `crew` needs it in order to avoid the case of orphaned servers that try to connect after the startup window expired and the socket already rotated with `saisei()`.)

This is absolutely not necessary any more as the socket connection is synced since mirai 0.8.4. If setting daemons returns you are good to go - the call will error if the client cannot sync with dispatcher and times out. Just retrieve the URLs directly to launch the servers - again these will be available if the daemons call returns (polling doesn't help). FYI the dispatcher pid is also available at $pid in the .compute environment.

2. Every call to `controller$pop()` [auto-scales the workers](https://github.com/wlandau/crew/blob/d92d2c4e23cd737429aa718393875033d77da5c2/R/crew_controller.R#L368-L369), which is important for handling transient workers with a long backlog of tasks. Every auto-scaling operation [needs to poll `daemons()` again](https://github.com/wlandau/crew/blob/d92d2c4e23cd737429aa718393875033d77da5c2/R/crew_controller.R#L228) in order to [check the `online` and `instance`](https://github.com/wlandau/crew/blob/d92d2c4e23cd737429aa718393875033d77da5c2/R/crew_controller.R#L567-L576) fields to see which workers are connected, discovered, and lost. In `targets` and Shiny, these calls to `controller$pop()` need to happen several times a second in order to be efficient.

I don't see any issue with this. Hope everything is clear now. Good luck with the tests!

That's really useful info, and it allowed me to simplify crew quite a lot: wlandau/crew@ff833e7. But unfortunately, I still see hanging tasks on my end: https://github.com/ropensci/targets/actions/runs/4936193312/jobs/8823439823. Based on previous logs, this is happening when checks on daemons() are okay, but a mirai tasks stays in an unresolved state indefinitely. It's really hard to reproduce without targets.

I wonder, could it have something to do with repeated calls to daemons() combined with repeated calls to nanonext::.unresolved()?

That's really good that it's allowed you to simplify things. I don't see how repeated calls to the above functions can cause problems.

Would you mind testing the latest build of mirai v. 0.8.4.9004 85953f8? I got CRAN feedback and I'm doing final tests on it. Notable change is that the default asyncdial is now FALSE across the package for additional safety. Also caught one bug in saisei().

I wonder, could it have something to do with repeated calls to daemons() combined with repeated calls to nanonext::.unresolved()?

This is a tricky issue. I just ran the tests from yesterday again just to confirm - and they do appear to be fixed - just through not polling daemons() right at the start. But perhaps it is still something to do with polling.

Did you have any specific concern with nanonext::.unresolved()? I don't believe that function has the capacity to hang as it is essentially just reading an int in a C struct on the local process.

With mirai 0.8.4.9004, the hanging is rare, but it still happens: https://github.com/ropensci/targets/actions/runs/4936588271/jobs/8824297563. From previous tests, I believe this is still a case where the dispatcher is running, daemons() polls just fine, and the server is running fine, but the task is still stuck at unresolved.

I have not been able to isolate this in a small reproducible example without targets, even after weeks of trying, so I wonder if it would be possible to run the same tests on a dev fork of mirai which prints a verbose log (or even a trace).

With mirai 0.8.4.9004, the hanging is rare, but it still happens: https://github.com/ropensci/targets/actions/runs/4936588271/jobs/8824297563. From previous tests, I believe this is still a case where the dispatcher is running, daemons() polls just fine, and the server is running fine, but the task is still stuck at unresolved.

Thanks! At least it is no worse than before.

I have not been able to isolate this in a small reproducible example without targets, even after weeks of trying, so I wonder if it would be possible to run the same tests on a dev fork of mirai which prints a verbose log (or even a trace).

Let me know what you want to try. I was going to suggest if it was at all possible to instrument the tests a bit (so we can try to isolate where exactly it is hanging).

I just had a thought - as the tests from yesterday were all just all repeatedly doing one run, which now succeeds. In the targets tests that fail, is saisei() being called? I wonder if it is to do with switching the listeners between tasks. Do you have any tests in which saisei() isn't called for comparison?

Let me know what you want to try. I was going to suggest if it was at all possible to instrument the tests a bit (so we can try to isolate where exactly it is hanging).

If you have any suggestions for how to isolate the tests, I am eager to try. As soon as I try to peel back the layers, either the test passes or the error is different.

I just had a thought - as the tests from yesterday were all just all repeatedly doing one run, which now succeeds. In the targets tests that fail, is saisei() being called?

I checked locally, and those targets tests actually do not call saisei() at all.