golang / go

The Go programming language

Home Page:https://go.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

runtime: time.Sleep takes more time than expected on Windows (1ms -> 10ms)

egonelbre opened this issue · comments

This seems to be a regression with Go 1.16 time.Sleep.

What version of Go are you using (go version)?

$ go version
go version go1.16 windows/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=Z:\gocache
set GOENV=C:\Users\egone\AppData\Roaming\go\env
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=F:\Go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=F:\Go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=c:\go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=c:\go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.16
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=f:\temp\sleep\go.mod
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=Z:\Temp\go-build3014714148=/tmp/go-build -gno-record-gcc-switches

What did you do?

package main

import (
	"fmt"
	"time"

	"github.com/loov/hrtime"
)

func main() {
	b := hrtime.NewBenchmark(100)
	for b.Next() {
		time.Sleep(50 * time.Microsecond)
	}
	fmt.Println(b.Histogram(10))
}

hrtime is a package that uses RDTSC for time measurements.

Output on Windows 10:

> go1.16 run .
  avg 15.2ms;  min 55.3µs;  p50 15.5ms;  max 16.3ms;
  p90 16.1ms;  p99 16.3ms;  p999 16.3ms;  p9999 16.3ms;
     55.3µs [  2] █
        2ms [  0]
        4ms [  0]
        6ms [  0]
        8ms [  0]
       10ms [  1] ▌
       12ms [  0]
       14ms [ 75] ████████████████████████████████████████
       16ms [ 22] ███████████▌
       18ms [  0]

> go1.15.8 run .
  avg 1.03ms;  min 63.9µs;  p50 1ms;  max 2.3ms;
  p90 1.29ms;  p99 2.3ms;  p999 2.3ms;  p9999 2.3ms;
     63.9µs [  1] ▌
      500µs [ 47] ███████████████████████████████████████
        1ms [ 48] ████████████████████████████████████████
      1.5ms [  1] ▌
        2ms [  3] ██
      2.5ms [  0]
        3ms [  0]
      3.5ms [  0]
        4ms [  0]
      4.5ms [  0]

Output on Linux (Debian 10):

$ go1.16 run test.go
  avg 1.06ms;  min 1.06ms;  p50 1.06ms;  max 1.08ms;
  p90 1.07ms;  p99 1.08ms;  p999 1.08ms;  p9999 1.08ms;
     1.06ms [  4] █▌
     1.07ms [ 84] ████████████████████████████████████████
     1.07ms [  7] ███
     1.08ms [  3] █
     1.08ms [  1]
     1.09ms [  1]
     1.09ms [  0]
      1.1ms [  0]
      1.1ms [  0]
     1.11ms [  0]

$ go1.15.8 run test.go
  avg 86.7µs;  min 57.3µs;  p50 83.6µs;  max 132µs;
  p90 98.3µs;  p99 132µs;  p999 132µs;  p9999 132µs;
     57.3µs [  2] █
       60µs [  1] ▌
       70µs [ 13] ████████
       80µs [ 64] ████████████████████████████████████████
       90µs [ 11] ██████▌
      100µs [  2] █
      110µs [  1] ▌
      120µs [  3] █▌
      130µs [  3] █▌
      140µs [  0]

The time granularity shouldn't be that bad even for Windows. So, there might be something going wrong somewhere.

That same CL shook out a number of kernel and runtime bugs in various configurations. (See previously #43067, #42515, #42237; cc @prattmic.)

This is reproducible with a trivial benchmark in time package:

func BenchmarkSimpleSleep(b *testing.B) {
       for i := 0; i < b.N; i++ {
               Sleep(50*Microsecond)
       }
}

amd64/linux, before/after http://golang.org/cl/232298:

name            old time/op  new time/op   delta
SimpleSleep-12  86.9µs ± 0%  609.8µs ± 5%  +601.73%  (p=0.000 n=10+9)

For reference, across different sleep times:

name                  old time/op  new time/op   delta
SimpleSleep/1ns-12     460ns ± 3%    479ns ± 1%    +4.03%  (p=0.000 n=10+9)
SimpleSleep/100ns-12   466ns ± 3%    476ns ± 2%    +2.35%  (p=0.001 n=10+9)
SimpleSleep/500ns-12  6.47µs ±11%   6.70µs ± 5%      ~     (p=0.105 n=10+10)
SimpleSleep/1µs-12    10.3µs ±10%   12.2µs ±13%   +18.23%  (p=0.000 n=10+10)
SimpleSleep/10µs-12   81.9µs ± 1%  502.5µs ± 4%  +513.45%  (p=0.000 n=10+10)
SimpleSleep/50µs-12   87.0µs ± 0%  622.9µs ±18%  +615.69%  (p=0.000 n=8+10)
SimpleSleep/100µs-12   179µs ± 0%   1133µs ± 1%  +533.52%  (p=0.000 n=8+10)
SimpleSleep/500µs-12   592µs ± 0%   1137µs ± 1%   +91.97%  (p=0.000 n=10+10)
SimpleSleep/1ms-12    1.12ms ± 2%   1.14ms ± 1%    +1.36%  (p=0.000 n=9+10)
SimpleSleep/10ms-12   10.2ms ± 0%   10.3ms ± 0%    +0.79%  (p=0.000 n=9+9)

Looking at the 100µs, the immediate problem is the delay resolution in netpoll.

Prior to http://golang.org/cl/232298, 95% of timer expirations in the 100µs case are detected by sysmon, which calls startm to wake an M to handle the timer (though this is not a particularly efficient approach).

After http://golang.org/cl/232298, this path is gone and the wakeup must come from netpoll (assuming all Ms are parked/blocked). netpoll on Linux only has 1ms resolution, so it must sleep at least that long before detecting the timer.

I'm not sure why I'm seeing ~500µs on the 10µs and 50µs benchmarks, but I may have bimodal distribution where ~50% of cases a spinning M is still awake long enough to detect the timer before entering netpoll.

I'm also not sure why @egonelbre is seeing ~14ms on Windows, as that also appears to have 1ms resolution on netpoll.

I think the ideal fix to this would be to increase the resolution of netpoll. Even for longer timers, this limited resolution will cause slight skew to timer delivery (though of course there are no real-time guarantees).

As it happens, Linux v5.11 includes epoll_pwait2 which switches the timeout argument to a timespec for nanosecond resolution. Unfortunately, Linux v5.11 was released ... 3 days ago, so availability is not widespread to say the least.

In the past, I've also prototyped changing the netpoll timeout to being controlled by a timerfd (with the intention of being able to adjust the timer earlier without a full netpollBreak). That could be an option as well.

Both of these are Linux-specific solutions, I'd have to research other platforms more to get a sense of the options there.

We also may just want to bring the sysmon wakeup back, perhaps with slight overrun allowed to avoid excessive M wakeups.

I guess that wakeNetPoller doesn't help, because there is no poller sleeping at the point of calling time.Sleep.

Perhaps when netpoll sees a delay that is shorter than the poller resolution it should just do a non-blocking poll. That will effectively turn findrunnable into a busy wait when the next timer is very soon.

While working on CL232298 I definitely observed anecdotal evidence that the netpoller has more latency than other ways of sleeping. From #38860 (comment):

My anecdotal observation here is that it appears the Linux netpoller implementation has more latency when waking after a timeout than the Go 1.13 timerproc implementation. Most of these benchmark numbers replicate on my Mac laptop, but the darwin netpoller seems to suffer less of that particular latency by comparison, and is also worse in other ways. So it may not be possible to close the gap with Go 1.13 purely in the scheduler code. Relying on the netpoller for timers changes the behavior as well, but these new numbers are at least in the same order of magnitude as Go 1.13.

I didn't try to address that in CL232298 primarily because it was already risky enough that I didn't want to make bigger changes. But an idea for something to try occurred to me back then. Maybe we could improve the latency of non-network timers by having one M block on a notesleep call instead of the netpoller. That would require findrunnable to compute two wakeup times, one for net timers to pass to the netpoller and one for all other timers to use with notesleep depending on which role it takes on (if any) when it cannot find any other work.

I haven't fully gauged how messy that would get.

Questions and concerns:

  • Coordinating two sleeping M's probably has complicating edge cases to figure out.
  • I haven't tested the latency of notesleep wakeups to know if it would actually help.
  • Would it require duplicating all the timer fields on each P, one for net timers and one for the others?

One other oddity that I noticed when testing CL232298: The linux netpoller sometimes wakes up from the timeout value early by a several microseconds. When that happens findrunnable usually does not find any expired timers since they haven't actually expired yet. A new--very short--pollUntil value gets computed and the M reenters the netpoller. The subsequent wakeup is then typically rather late, maybe up to 1ms, but I am going from memory here. I might be able to dig up some trace logs showing this behavior if I still have them and people are interested.

I guess that wakeNetPoller doesn't help, because there is no poller sleeping at the point of calling time.Sleep.

wakeNetPoller shouldn't matter either way, because even if we wake the netpoller, it will just sleep again with a new timeout of 1ms, which is too long. (Unless wakeNetPoller happens to take so long that the timer has expired by the time the woken M gets to checkTimers).

Perhaps when netpoll sees a delay that is shorter than the poller resolution it should just do a non-blocking poll. That will effectively turn findrunnable into a busy wait when the next timer is very soon.

Maybe we could improve the latency of non-network timers by having one M block on a notesleep call instead of the netpoller.

As somewhat of a combination of these, one potential option would be to make netpoll with a short timeout do non-blocking netpoll, short notetsleep, non-blocking netpoll. Though this has the disadvantage of slightly increasing latency of network events from netpoll.

One other oddity that I noticed when testing CL232298: The linux netpoller sometimes wakes up from the timeout value early by a several microseconds. When that happens findrunnable usually does not find any expired timers since they haven't actually expired yet. A new--very short--pollUntil value gets computed and the M reenters the netpoller. The subsequent wakeup is then typically rather late, maybe up to 1ms, but I am going from memory here. I might be able to dig up some trace logs showing this behavior if I still have them and people are interested.

Hm, this sounds like another bug, or perhaps a spurious netpollBreak from another M.

It seems that on Windows, notetsleep has 1ms precision in addition to netpoll, so the explanation in #44343 (comment) doesn't explain the increase in latency on Windows.

My first thought on the Windows behavior is that somehow osRelax is being mismanaged and allowing the timer resolution to decrease to its resting mode. That thought is driven by the spike on the above histograms at 15ms. I haven't thought through how that might happen now.

Hm, this sounds like another bug, or perhaps a spurious netpollBreak from another M.

That could be, but I was logging at least some of the calls to netpollBreak as well and don't recall seeing seeing that happen. I saved my logging code in case it can help. https://github.com/ChrisHines/go/tree/dlog-backup

For reference, output on my Windows 10:

> go1.16 run .
  avg 1.06ms;  min 475µs;  p50 1.01ms;  max 1.99ms;
  p90 1.13ms;  p99 1.99ms;  p999 1.99ms;  p9999 1.99ms;
      475µs [  1] ▌
      600µs [  1] ▌
      800µs [ 36] ██████████████████████████
        1ms [ 55] ████████████████████████████████████████
      1.2ms [  0]
      1.4ms [  0]
      1.6ms [  0]
      1.8ms [  7] █████
        2ms [  0]
      2.2ms [  0]

A totally different results from the #44343 (comment).

go env Output
$ go env
set GO111MODULE=on
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\user\AppData\Local\go-build
set GOENV=C:\Users\user\AppData\Roaming\go\env
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Projects\Go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Projects\Go
set GOPRIVATE=
set GOPROXY=
set GOROOT=C:\Tools\Go\go1.16
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Tools\Go\go1.16\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.16
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=NUL
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=C:\Users\user\AppData\Local\Temp\go-build2862485594=/tmp/go-build -gno-record-gcc-switches

@vellotis this could be because there's something running in the background changing Windows timer resolution. This could be some other Go service/binary built using older Go version. Of course there can plenty of other programs that may change it.

You can use https://github.com/tebjan/TimerTool to see what the current value is. There's some more detail in https://randomascii.wordpress.com/2013/07/08/windows-timer-resolution-megawatts-wasted.

For what it's worth, go1.16.2 darwin/amd64 is also exhibiting this. In a program I'm running, the line time.Sleep(1 * time.Hour) takes roughly an hour and 3 minutes each time.

@zobkiw It sounds like they is more likely to be related to #44868. I'm also curious if the failure is consistent and it really is always 1hr 3min, not 1hr 2min (or rather, ranging from 2-4min since alignment with 2min forcegc will vary)?

@prattmic Strangely, I just checked the output of the script now and (although I just restarted it a few hours before) it was spot on at one hour twice in a row. However, yesterday it was "generally" 3 minutes. I don't have an exact time since we were only seeing the minute logged. It was always 3 minutes (rounded) from the previous run. 1:00pm, 2:03pm, 3:06pm, 4:09pm, etc.

Some things to note about this loop I'm running, it is calling a shell script using exec.Command before sleeping for an hour, then it does it again. The script takes about 10 seconds to execute and completes its job. It is also running in a screen session but so was the one this morning that was doing fine so I don't think that was it. The machine it was running on, a mac mini, was basically idle other that this lightweight job.

If you have some code you would like me to run I would be happy to - otherwise I will keep an eye on it here as well and report anything else I see. Hopefully some of this is helpful in narrowing down the cause if you haven't already.

UPDATE 3-20-2021 10am ET: I just ran my program for about the last 24 hours and sure enough, it would run fine for the first few iterations and then start to sleep longer based on the logging. Mostly it would hover between 3-5 minutes late but once it was 9 minutes! I've (this morning) written another program using the same logic but with simulated activity (since the actual tasks did not seem to be part of the problem) and much more detailed logging. I have it running now in 3 environments, two using screen and one not. It is running on an M1 Big Sur 11.2.3 and an Intel mac mini running the same. One thing that struck me this morning was wondering if the machine being asleep at all caused the delay. Since it runs OK for awhile (while I was using the machine presumably) and then had the delays over night. This morning, the latest reading (while I was using the machine again to install the new test code) was back to 1 hour spot on - no delay. Once I get some more data at the end of the day or tomorrow morning I will report back and share the code if there remains a problem.

commented

3 minutes per hour is rather worrisome, especially if some cron-like functionality required in a long-running service...

commented

In case it's related; on Windows, timers tick while the system is in standby/suspend, but they pause on other platforms.

This is supposed to have been addressed already #36141

@prattmic et al, I took the liberty of writing a testbed to experiment with this issue. See all the details of test runs on Linux, Intel, and M1, along with source code here:

https://github.com/zobkiw/sleeptest

Comments welcome - especially if you find a bug in the code. Also, see the results directory for the output of the tests. Regardless, these results seem strange to me. I look forward to other input on this issue.

commented

cc @zx2c4 :-)

@zobkiw could it be something due to cpu frequency scaling? Can you try locking the cpu to a specific frequency (something low) and see whether the problem disappears?

I added a comment to my testbed README about a short Python test - Python has the same issues.

@egonelbre I'm not familiar enough with CPU frequency scaling to mess with that. There may be others researching this issue that are better suited to take a step digging in at that level. Thanks for the suggestion.

@zobkiw It would be handy if you could see if Go 1.15 suffers from the same problem in your environment. But given that you are seeing similar issues with Python, this sounds like it is probably an OS issue.

@zobkiw It would be handy if you could see if Go 1.15 suffers from the same problem in your environment. But given that you are seeing similar issues with Python, this sounds like it is probably an OS issue.

I agree it is likely an OS issue.

I can't change the version on the Intel machine since others depend on it, but can on the M1. Mind you, will go1.15.10.darwin-amd64 run properly on the M1 and be a reasonable test?

@zobkiw I failed to mention earlier that #44868 has already been fixed, and the fix was just released in 1.16.3. I would love to know if your issue still occurs on 1.16.3.

@prattmic i'll try it - thanks for the heads up! Will report back tomorrow.

UPDATE: Seems much better. Runs on an M1 under 1.16.3 included the following times:

amd64: [59m59.90839s 1h0m0.062829s 59m59.992974s]
arm64: [59m59.979776s 1h0m0.019405s 59m59.98669s 1h0m0.001715s 1h0m0.001987s 59m59.976767s]

I ignored a final arm64 time that took over 2 hours (although it is in the results in the repo) since the machine was asleep (closed) and then woken back up which allowed it to finally complete. The repo with the sample code and explanation has been updated. Thanks everyone for looking into this and resolving it!

I was pointed to this issue after posting the following on Reddit: https://www.reddit.com/r/golang/comments/mq9jt5/poor_rate_limiter_performance_on_windows/

The gist is that rate limiters (golang.org/x/time/rate, go.uber.org/ratelimit, etc.) are unusable in v1.16.3 on Windows for even modest rates (hundreds of tokens per second). v1.15.11 can do a 10x higher rate. For my use case (rates less than 15,000), there is no impact on Linux.

@prattmic Do you know what is the status of this issue by now, after #44868 has been fixed: is there more to do here? Specifically, do you think there's more we need to do for 1.17, or can the rest of the work happen in a later release? (If so, we should move this to Backlog.)

The latest comment from @freb could be made into a new issue for the golang.org/x/time/rate package directly, if this runtime issue is no longer very actionable.

@dmitshur

I just ran this:

# go run .
  avg 830µs;  min 81.1µs;  p50 1.13ms;  max 1.97ms;
  p90 1.48ms;  p99 1.63ms;  p999 1.77ms;  p9999 1.97ms;
     81.1µs [3658] ████████████████████████████████████████
      200µs [ 487] █████
      400µs [  48] ▌
      600µs [  77] ▌
      800µs [ 276] ███
        1ms [ 867] █████████
      1.2ms [2722] █████████████████████████████▌
      1.4ms [1630] █████████████████▌
      1.6ms [ 227] ██
      1.8ms [   8] 

# go version
go version go1.16.4 linux/amd64

This is still a huge regression in my opinion, far from fixed. I have been checking this issue regularly for months, because I can't believe that such a large regression wasn't worth reverting this CL, regardless of other fixes that were included in that CL.

I would really like to be able to rely on timers again in Go 1.17. As @freb mentions, this directly impacts rate limiting logic, but timers are also fundamental to so much more. idk. That's just my opinion.

Perhaps a new issue is required.

EDIT: running on Go 1.15.12 on the same machine has the following results:

# go run .
  avg 211µs;  min 77.8µs;  p50 192µs;  max 1.62ms;
  p90 303µs;  p99 411µs;  p999 572µs;  p9999 1.62ms;
     77.8µs [ 151] █
      100µs [ 910] ███████▌
      150µs [4801] ████████████████████████████████████████
      200µs [2111] █████████████████▌
      250µs [ 969] ████████
      300µs [ 643] █████
      350µs [ 287] ██
      400µs [  84] ▌
      450µs [  24] 
      500µs+[  20] 

# go version
go version go1.15.12 linux/amd64

Still much higher than the 50 microseconds that the benchmark sleeps for, but it's an order of magnitude closer to the expected latency.

#44868 is unrelated to this issue, so there is still work to do here, both for 1.17 and likely backport to 1.16.

I've been a passive observer of this issue for awhile, but I decided tonight to try to quantify this issue better. Clearly this is a problem with time.Sleep(50 * time.Microsecond), but where does the problem start, and where does it end?

Go 1.15:
linux arm go-1 15

Go 1.16:
linux arm go-1 16

Go 1.16 (ARM, Mac)
mac arm go-1 16

Go 1.15 (x86-64):
Screen Shot 2021-06-01 at 9 11 31 PM

Go 1.16 (x86-64):
Screen Shot 2021-06-01 at 9 11 41 PM

(EDIT: the last graph above says p95... it should say p99 too)

Unfortunately, I didn't have time to collect the data for any non-virtualized Linux machines tonight, but I could probably do that some other time if anyone thinks it's important and they simply don't have time to contribute similar graphs themselves.

Now, looking at these graphs, they are scaled logarithmically on the y-axis. The formula used to calculate the y-axis ("timing error") is (pX / sleep_ns) - 1, where pX is p50, p95, or p99. sleep_ns is the actual sleep time that we passed to time.Sleep.

Ideally, there would only be a flat line at 0, representing that we slept for exactly the requested amount of time every time. In the real world, this is obviously impossible.

It's interesting that the x86-64 and ARM results diverged so heavily in their overall appearance.

On Go 1.16 on ARM, we can see a rhythmic motion to the line that seems to indicate a preference towards whole numbers of milliseconds -- I think it was mentioned earlier in this thread that this was pretty much an expected fall out from netpoller.

On Go 1.15 on ARM, we can see a much faster drop in the error, and then it stays lower.

Fascinatingly, Go 1.16 on Mac shows an extremely stable result... the error drops below 40% after the target reaches only 20µs time.Sleep(20*time.Microseconds), and it quickly drops to 30% - 35% shortly after that. Weirdly, it never gets as low as it does on Linux... presumably this is macOS doing timer coalescing, even while plugged in? That would be my guess.

The results on x86-64 appear to paint a similar, but still distinct picture. The Go 1.16 results appear to have about 3x as much error as compared to Go 1.15.

Beyond about 3ms as the target sleep period, Go 1.15 and Go 1.16 seemed to converge in terms of error, so I focused my final test runs on the period below 3ms.

I have attached the raw data and the code I was using, which is modeled after the benchmark code provided earlier in this discussion: benchmark.zip

EDIT: right after posting, I found an error with how I generated the graphs for x86-64, but I have uploaded corrected versions now

@coder543 Thanks for the analysis, that is helpful.

I definitely think we should fix this for 1.18, probably with some variant of the ideas in @ChrisHines' #44343 (comment) and replies.

For 1.16 and 1.17, I don't think we should do anything. There is no easy fix for this. I think the only possible backport would be to re-add sysmon timer checks, but that has pretty big efficiency effects, so it isn't great. I've not seen mentioning bad effects this is causing in real applications (please correct me if you have a case), so I think waiting for 1.18 makes sense.

@prattmic you can see my comment for impact on a real application. I'm using a rate limiter to control the number of ports per second my port scanner is able to scan. With Go 1.16 on Windows, this maxes out at 500-600 per second. On Linux it is about 4 times that, but still hits a pretty low ceiling.

Any application that requires a rate limiter with more granular resolution than this is currently out of luck. Just how many applications that applies to, I can't say. But it does affect my real world application.

Ah, I missed that comment, thanks for the heads up @freb. I'll think a bit more about what we could do as a backport.

I think I'm hitting this too, in an emulator I'm writing. I call time.Sleep to slow down my emulated processor loop as my real hardware is much faster than the system I'm emulating.
On Linux it works as expected (sleep of about 30 microseconds works well), but on Windows even a sleep of 1 nano second, causes a huge slow down and my processor cycles drops to about 1,000 less than when I have the sleep removed

commented

You can't sleep less than 1-2ms on Windows, without resorting to the WinAPI. This has been discussed at length in other golang issues; see the Windows tag.

Thanks @networkimprov, I think a lot of Go devs aren't going to be diving through the issue history in this repo. A note on Sleep function in docs to that effect would go a long way

commented

There aren't a lot of Windows devs working in/on Go at this point, from what I've seen.

FWIW, I can reproduce this on my Home Windows PC with 1.16.3 and 1.17beta1. Using the code from the original issue:

C:\Users\Chris\proj\goissue44343-slow-timer>go1.17beta1 run .
  avg 13.9ms;  min 178µs;  p50 15.8ms;  max 16.2ms;
  p90 16.1ms;  p99 16.2ms;  p999 16.2ms;  p9999 16.2ms;
      178µs [ 12] █████████▌
        2ms [  0]
        4ms [  0]
        6ms [  0]
        8ms [  0]
       10ms [  0]
       12ms [  0]
       14ms [ 50] ████████████████████████████████████████
       16ms [ 38] ██████████████████████████████
       18ms [  0]


C:\Users\Chris\proj\goissue44343-slow-timer>go run .          [this is Go 1.16.3]
  avg 14.4ms;  min 196µs;  p50 15.5ms;  max 16.2ms;
  p90 16ms;  p99 16.2ms;  p999 16.2ms;  p9999 16.2ms;
      196µs [  7] ████
        2ms [  0]
        4ms [  0]
        6ms [  0]
        8ms [  0]
       10ms [  1] ▌
       12ms [  0]
       14ms [ 67] ████████████████████████████████████████
       16ms [ 25] ██████████████▌
       18ms [  0]


C:\Users\Chris\proj\goissue44343-slow-timer>go1.15.9 run .    
  avg 988µs;  min 247µs;  p50 1ms;  max 1.47ms;
  p90 1ms;  p99 1.47ms;  p999 1.47ms;  p9999 1.47ms;
      248µs [  1] ▌
      400µs [  2] █
      600µs [  0]
      800µs [ 26] ██████████████▌
        1ms [ 70] ████████████████████████████████████████
      1.2ms [  0]
      1.4ms [  1] ▌
      1.6ms [  0]
      1.8ms [  0]
        2ms [  0]

No promises, but I'll try to find some time to test it with an instrumented build to see if I can spot any clues.

I found the source of the change in behavior. golang.org/cl/248699 improved the Windows version of runtime.usleep to use a high-res timer when available (currently when GOARCH == "386" || GOARCH == "amd64"). It also changed osRelax to a noop when the high-res timer is available. So as I suspected in #44343 (comment), the timer resolution used by runtime timers stays at the default Windows resolution of ~15ms.

I confirmed this by modifying the runtime to disable the high-res timer logic in go1.17beta1 and I got these results:

  avg 1.1ms;  min 897µs;  p50 1ms;  max 2ms;
  p90 1.52ms;  p99 2ms;  p999 2ms;  p9999 2ms;
      897µs [ 10] █████
        1ms [ 78] ████████████████████████████████████████
      1.2ms [  0]
      1.4ms [  5] ██▌
      1.6ms [  0]
      1.8ms [  4] ██
        2ms [  3] █▌
      2.2ms [  0]
      2.4ms [  0]
      2.6ms [  0]

/cc @alexbrainman

commented

Are you sure the test is precise? That is to say - is it possible that we're sleeping with high levels of precision, but nanotime() is now in 16ms chunks because we're no longer using the global battery draining timeBeginPeriod() calls, so your test isn't precisely measuring timer precision?

That's a good question. I will try to verify that soon.

The test uses QueryPerformanceCounter for time measurement.

The library uses QueryPerformanceCounter internally for NewBenchmark.
Without the library the benchmark roughly looks like:

var experiments [N]int64
for i := range experiments {
    QueryPerformanceCounter(&experiments[i])
    time.Sleep(50 * time.Microsecond)
}
fmt.Println(createHistogram(experiments[:]))

Similarly #44343 shows a similar problem, but doesn't use fine-grained measurements.

Are you sure the test is precise? That is to say - is it possible that we're sleeping with high levels of precision, but nanotime() is now in 16ms chunks because we're no longer using the global battery draining timeBeginPeriod() calls, so your test isn't precisely measuring timer precision?

Yes, the test looks precise. As @egonelbre pointed out, the benchmark in the test uses QueryPerformanceCounter rather than nanotime to record times. In addition, I ran a version of the test on a debuglog instrumented version of the 1.17beta1 runtime to see how the time.Sleep is interacting with the netpoller and sysmon. Here are the interesting excerpts of the logs I got from that.

First, the version where high-res timers are enabled and osRelax always returns 0 without calling timeBeginPeriod:

Ten iterations produced this histogram.

  avg 14.7ms;  min 8.18ms;  p50 15.5ms;  max 15.9ms;
  p90 15.9ms;  p99 15.9ms;  p999 15.9ms;  p9999 15.9ms;
     8.18ms [  1] █████
        9ms [  0] 
       10ms [  0] 
       11ms [  0] 
       12ms [  0] 
       13ms [  0] 
       14ms [  1] █████
       15ms [  8] ████████████████████████████████████████
       16ms [  0] 
       17ms [  0] 

The deguglog starting from the point main.main is called is shown below. In particular note:

  • sysmon, which uses usleep, wakes up at ~500 micro-second intervals as measured by nanotime
  • netpoller, which uses GetQueuedCompletionStatusEx, wakes up 8ms late after the first time.Sleep and ~15ms late on the second time.Sleep, look for the lines that say "findrunnable return from netpoller ..."
   ...
[0.004087300 P 0] main.main
[0.004087300 P 0] gopark G 1 sleep
[0.004087300 P 0] wakeNetPoller wakep for target time 4137300
[0.004087300 P 0] wakep startm
[0.004087300 P 0] startm claim P 1 for M 0
[0.004087300 P 0] schedule M 0
[0.004087300 P 0] findrunnable ran checkTimers, pollUntil 4137300
[0.004087300 P 0] findrunnable skipping work stealing; spinning M's 1 busy M's 2
[0.004087300 P -1] findrunnable block on netpoller oldP 0 M 0 delay 50000 ns poll until 4137300 was spinning false
[0.004087300 P 1] findrunnable woke from stopm M 3
[0.004087300 P 1] findrunnable ran checkTimers, no pending timers
[0.004087300 P 1] findrunnable work stealing
[0.004087300 P -1] findrunnable last timer check for P 0 found timer 4137300 oldP 1 M 3
[0.004087300 P -1] findrunnable stopm oldP 1 M 3
[0.004598700 P -1] sysmon wake from usleep
[0.004598700 P -1] sysmon check, gcwaiting 0 npidle 16
[0.004598700 P -1] sysmon next timer target 4137300
[0.004598700 P -1] sysmon idle count 7 usleep 20
[0.005117200 P -1] sysmon wake from usleep
[0.005117200 P -1] sysmon check, gcwaiting 0 npidle 16
[0.005117200 P -1] sysmon next timer target 4137300
   ... sysmon loops elided ...
[0.011351000 P -1] sysmon idle count 20 usleep 20
[0.011865800 P -1] sysmon wake from usleep
[0.011865800 P -1] sysmon check, gcwaiting 0 npidle 16
[0.011865800 P -1] sysmon next timer target 4137300
[0.011865800 P -1] sysmon idle count 21 usleep 20
[0.012372100 P -1] findrunnable return from netpoller set lastpoll 12372100 late ns 8234800
[0.012372100 P 1] findrunnable return from netpoller M 0 delay was 50000 ns
[0.012372100 P 1] findrunnable ran checkTimers, no pending timers
[0.012372100 P 1] findrunnable work stealing
[0.012372100 P 1] findrunnable woke G 1 while stealing timers from P 0
[0.012372100 P 1] wakep startm
[0.012372100 P 1] startm claim P 0 for M 0
[0.012372100 P -1] sysmon wake from usleep
[0.012372100 P 1] schedule run G 1
[0.012372100 P -1] sysmon idle count 22 usleep 20
[0.012372100 P 0] findrunnable woke from stopm M 3
[0.012372100 P 0] findrunnable ran checkTimers, no pending timers
[0.012372100 P 0] findrunnable work stealing
[0.012372100 P 1] gopark G 1 sleep
[0.012372100 P -1] findrunnable stopm oldP 0 M 3
[0.012408900 P 1] wakeNetPoller wakep for target time 12422100
[0.012408900 P 1] wakep startm
[0.012408900 P 1] startm claim P 0 for M 0
[0.012408900 P 1] schedule M 0
[0.012408900 P 1] findrunnable ran checkTimers, pollUntil 12422100
[0.012408900 P 1] findrunnable skipping work stealing; spinning M's 1 busy M's 2
[0.012408900 P -1] findrunnable block on netpoller oldP 1 M 0 delay 13200 ns poll until 12422100 was spinning false
[0.012408900 P 0] findrunnable woke from stopm M 3
[0.012408900 P 0] findrunnable ran checkTimers, no pending timers
[0.012408900 P 0] findrunnable work stealing
[0.012408900 P -1] findrunnable last timer check for P 1 found timer 12422100 oldP 0 M 3
[0.012408900 P -1] findrunnable stopm oldP 0 M 3
[0.012920000 P -1] sysmon wake from usleep
[0.012920000 P -1] sysmon check, gcwaiting 0 npidle 16
[0.012920000 P -1] sysmon next timer target 12422100
[0.012920000 P -1] sysmon idle count 23 usleep 20
[0.013433200 P -1] sysmon wake from usleep
[0.013433200 P -1] sysmon check, gcwaiting 0 npidle 16
[0.013433200 P -1] sysmon next timer target 12422100
   ... sysmon loops elided ...
[0.026330900 P -1] sysmon idle count 49 usleep 20
[0.026844700 P -1] sysmon wake from usleep
[0.026844700 P -1] sysmon check, gcwaiting 0 npidle 16
[0.026844700 P -1] sysmon next timer target 12422100
[0.026844700 P -1] sysmon idle count 50 usleep 20
[0.027358100 P -1] sysmon wake from usleep
[0.027358100 P -1] sysmon check, gcwaiting 0 npidle 16
[0.027358100 P -1] sysmon next timer target 12422100
[0.027358100 P -1] sysmon idle count 51 usleep 40
[0.027868100 P -1] findrunnable return from netpoller set lastpoll 27868100 late ns 15446000
[0.027868100 P 0] findrunnable return from netpoller M 0 delay was 13200 ns
[0.027868100 P 0] findrunnable ran checkTimers, no pending timers
[0.027868100 P 0] findrunnable work stealing
[0.027884200 P 0] findrunnable woke G 1 while stealing timers from P 1
[0.027884200 P 0] wakep startm
[0.027884200 P 0] startm claim P 1 for M 0
[0.027884200 P -1] sysmon wake from usleep
[0.027884200 P 0] schedule run G 1
    ...

Second, the version where high-res timers are enabled and osRelax calls timeBeginPeriod as it did prior to https://golang.org/cl/248699 (I just commented out the return 0 line in the if haveHighResTimer check):

Ten iterations produced this histogram.

  avg 1.01ms;  min 856µs;  p50 1.03ms;  max 1.05ms;
  p90 1.05ms;  p99 1.05ms;  p999 1.05ms;  p9999 1.05ms;
      857µs [  1] █████
      900µs [  0] 
      950µs [  0] 
        1ms [  8] ████████████████████████████████████████
     1.05ms [  1] █████
      1.1ms [  0] 
     1.15ms [  0] 
      1.2ms [  0] 
     1.25ms [  0] 
      1.3ms [  0] 

The deguglog starting from the point main.main is called is shown below. In particular note:

  • sysmon, which uses usleep, wakes up at ~500 micro-second intervals as measured by nanotime
  • netpoller, which uses GetQueuedCompletionStatusEx, wakes up ~980ms late after each of the two time.Sleep calls shown
[0.003765200 P 1] main.main
[0.003765200 P 1] gopark G 1 sleep
[0.003765200 P 1] wakeNetPoller wakep for target time 3815200
[0.003765200 P 1] wakep startm
[0.003765200 P 1] startm claim P 0 for M 0
[0.003765200 P 1] schedule M 0
[0.003765200 P 1] findrunnable ran checkTimers, pollUntil 3815200
[0.003765200 P 1] findrunnable skipping work stealing; spinning M's 1 busy M's 2
[0.003765200 P -1] findrunnable block on netpoller oldP 1 M 0 delay 50000 ns poll until 3815200 was spinning false
[0.003765200 P 0] findrunnable woke from stopm M 3
[0.003765200 P 0] findrunnable ran checkTimers, no pending timers
[0.003765200 P 0] findrunnable work stealing
[0.003765200 P -1] findrunnable last timer check for P 1 found timer 3815200 oldP 0 M 3
[0.003765200 P -1] findrunnable stopm oldP 0 M 3
[0.004270000 P -1] sysmon wake from usleep
[0.004270000 P -1] sysmon check, gcwaiting 0 npidle 16
[0.004270000 P -1] sysmon next timer target 3815200
[0.004270000 P -1] sysmon idle count 7 usleep 20
[0.004791800 P -1] sysmon wake from usleep
[0.004791800 P -1] sysmon check, gcwaiting 0 npidle 16
[0.004791800 P -1] findrunnable return from netpoller set lastpoll 4791800 late ns 976600
[0.004791800 P -1] sysmon next timer target 3815200
[0.004791800 P -1] sysmon idle count 8 usleep 20
[0.004791800 P 0] findrunnable return from netpoller M 0 delay was 50000 ns
[0.004791800 P 0] findrunnable ran checkTimers, no pending timers
[0.004791800 P 0] findrunnable work stealing
[0.004797600 P 0] findrunnable woke G 1 while stealing timers from P 1
[0.004797600 P 0] wakep startm
[0.004797600 P 0] startm claim P 1 for M 0
[0.004797600 P 0] schedule run G 1
[0.004797600 P 0] gopark G 1 sleep
[0.004797600 P 1] findrunnable woke from stopm M 3
[0.004797600 P 0] wakeNetPoller wakep for target time 4847600
[0.004797600 P 0] schedule M 0
[0.004797600 P 1] findrunnable ran checkTimers, no pending timers
[0.004797600 P 0] findrunnable ran checkTimers, pollUntil 4847600
[0.004797600 P 1] findrunnable work stealing
[0.004797600 P 0] findrunnable skipping work stealing; spinning M's 1 busy M's 2
[0.004797600 P -1] findrunnable block on netpoller oldP 0 M 0 delay 50000 ns poll until 4847600 was spinning false
[0.004797600 P -1] findrunnable last timer check for P 0 found timer 4847600 oldP 1 M 3
[0.004797600 P -1] findrunnable stopm oldP 1 M 3
[0.005306600 P -1] sysmon wake from usleep
[0.005306600 P -1] sysmon check, gcwaiting 0 npidle 16
[0.005306600 P -1] sysmon next timer target 4847600
[0.005306600 P -1] sysmon idle count 9 usleep 20
[0.005822300 P -1] sysmon wake from usleep
[0.005822300 P -1] sysmon check, gcwaiting 0 npidle 16
[0.005822300 P -1] sysmon next timer target 4847600
[0.005822300 P -1] sysmon idle count 10 usleep 20
[0.005828000 P -1] findrunnable return from netpoller set lastpoll 5828000 late ns 980400
[0.005828000 P 1] findrunnable return from netpoller M 0 delay was 50000 ns
[0.005828000 P 1] findrunnable ran checkTimers, no pending timers
[0.005828000 P 1] findrunnable work stealing
[0.005828000 P 1] findrunnable woke G 1 while stealing timers from P 0
[0.005828000 P 1] wakep startm
[0.005828000 P 1] startm claim P 0 for M 0
[0.005828000 P 1] schedule run G 1

If you want to reproduce this first test above build my instrumented fork at ChrisHines@43c29fd and use it to compile the below program with -tags=debuglog and run it.

package main

import (
	"time"

	"github.com/loov/hrtime"
)

func main() {
	b := hrtime.NewBenchmark(10)
	for b.Next() {
		time.Sleep(50 * time.Microsecond)
	}
	panic(b.Histogram(10).String())
}

To reproduce the second test above comment out the return 0 on line 431 of osRelax in runtime/os_windows.go. Then rebuild and run the main program above.

commented

Seems like a straightforward fix? If so, can we milestone for 1.17 and backport?

cc @dmitshur

Seems like a straightforward fix? If so, can we milestone for 1.17 and backport?

Not sure I see (yet) what the "straightforward" fix is. Going back to timeBeginPeriod isn't so appealing, for example.

commented

(Sorry if it was suggested and rejected already)

Folks, how would it sound if we introduce a new call, something like NanoSleep() or HighResSleep() - something to a degree that identifies that this sleep function is based on higher-precision logic?
On systems that don't have this problem, HighResSleep will just call Sleep and be done with it. On system that have hi-res timers it will add bit of a dance around what low-level function to call. If there is a limitation that system does not provide "long-shot" sleeps for high-resolution timers, we might have to add an internal logic to invoke low-res sleep for majority of the requested time, and then fire up hi-res one for the rest.

My reasoning is very simple - a lot of calls to Sleep are in minutes and hours range, and quite often even if they are smaller, programming logic does not require 1ms precision, and should work fine with 100ms (or even larger) imprecision.

In general, I would expect the precision of the sleep to be proportional to the value passed in.

As a hypothetical:

If you are writing a game engine, you may need to serve a new frame every 16 milliseconds. After issuing the frame, you may want to sleep until the next frame, which is (16ms - the time it took to prepare this frame). On your powerful personal computer, this is often 8ms to 10ms! So then you choose time.Sleep and ship the game. Then the game runs poorly on a customer’s machine, and it turns out that processing is taking most of the 16ms on that machine, and then it is oversleeping. Clearly you should have chosen time.NanoSleep even though you sometimes sleep for 10ms!

Clearly, this whole outcome would have been avoidable if time.Sleep chose NanoSleep automatically for small values.

There are probably three general levels of sleep precision:

  1. the current sleep precision
  2. the more precise sleeping used previously in Go
  3. busy waiting.

If the sleep value is small enough and nothing else is runnable, it would likely make sense for the Go runtime to simply busy wait until the sleeping Goroutine is runnable again.

Is there any downside to automatically selecting more precise sleep modes? If the time to sleep is more than 5ms, the current approach should definitely be precise enough, and it allows the runtime and the OS to take more liberties with task scheduling to achieve efficiency. But, if the value is smaller… selecting a more precise sleep mechanism seems unlikely to cause any harm that wasn’t directly requested.

If someone wants to sleep less precisely and coalesce timers further, they could round up to the nearest second, minute, or whatever themselves. But, if the runtime decided to sleep for an extra, say, 1% as an acceptable margin of error, this would give the runtime a lot of opportunity to coalesce large sleeps automatically, while still enabling high precision sleeps to be serviced appropriately.

The current API doesn’t expose a way to be more precise, and I’m not sure adding more APIs is the best solution here, since you’ll never be able to guarantee a given level of service — the whole computer could hibernate before the task becomes runnable again, and then there’s nothing the runtime can do.

But, those are just my thoughts.

I just built https://github.com/loov/hrtime/blob/master/_example/basic/main.go on current tip. And run it on one of my Windows 10 laptops. Here is the output:

C:\Users\Alex>test
1.6846ms
  avg 2.05ms;  min 43.3µs;  p50 1.99ms;  max 1s;
  p90 2.03ms;  p99 2.11ms;  p999 2.26ms;  p9999 1s;
     43.3µs [   2]
      500µs [ 114] █▌
        1ms [ 731] ████████████
      1.5ms [2356] ████████████████████████████████████████
        2ms [ 892] ███████████████
      2.5ms [   0]
        3ms [   0]
      3.5ms [   0]
        4ms [   0]
      4.5ms+[   1]


C:\Users\Alex>test
1.3457ms
  avg 1.82ms;  min 5.2µs;  p50 1.83ms;  max 1s;
  p90 2.01ms;  p99 2.11ms;  p999 2.31ms;  p9999 1s;
      5.2µs [  34] ▌
      500µs [ 133] ███
        1ms [1698] ████████████████████████████████████████
      1.5ms [1618] ██████████████████████████████████████
        2ms [ 611] ██████████████
      2.5ms [   1]
        3ms [   0]
      3.5ms [   0]
        4ms [   0]
      4.5ms+[   1]


C:\Users\Alex>test
1.0907ms
  avg 1.91ms;  min 3.4µs;  p50 1.96ms;  max 1s;
  p90 2.01ms;  p99 2.13ms;  p999 2.25ms;  p9999 1s;
      3.4µs [  47] ▌
      500µs [ 174] ███▌
        1ms [1143] ███████████████████████
      1.5ms [1973] ████████████████████████████████████████
        2ms [ 757] ███████████████
      2.5ms [   1]
        3ms [   0]
      3.5ms [   0]
        4ms [   0]
      4.5ms+[   1]


C:\Users\Alex>test
1.4356ms
  avg 1.97ms;  min 14µs;  p50 1.97ms;  max 1s;
  p90 2.02ms;  p99 2.15ms;  p999 2.31ms;  p9999 1s;
       14µs [  17]
      500µs [ 218] ████
        1ms [ 902] █████████████████
      1.5ms [2088] ████████████████████████████████████████
        2ms [ 869] ████████████████▌
      2.5ms [   0]
        3ms [   0]
      3.5ms [   0]
        4ms [   0]
      4.5ms+[   2]


C:\Users\Alex>test
157µs
  avg 1.95ms;  min 11.8µs;  p50 1.98ms;  max 1s;
  p90 2.02ms;  p99 2.08ms;  p999 2.27ms;  p9999 1s;
     11.8µs [   5]
      500µs [  95] █▌
        1ms [1227] ████████████████████████▌
      1.5ms [1989] ████████████████████████████████████████
        2ms [ 778] ███████████████▌
      2.5ms [   0]
        3ms [   0]
      3.5ms [   0]
        4ms [   0]
      4.5ms+[   2]


C:\Users\Alex>

The output looks reasonable to me. Most of sleep time is around 1-2 ms. That is what I would expect on Windows.

Alex

@alexbrainman:

I just built https://github.com/loov/hrtime/blob/master/_example/basic/main.go on current tip. And run it on one of my Windows 10 laptops. Here is the output:

The output looks reasonable to me. Most of sleep time is around 1-2 ms. That is what I would expect on Windows.

I don't see any changes to the runtime since go1.17beta1 that I would expect to change the behavior of time.Sleep on Windows or in general. Do you have any theories why your results differ from what I got on go1.17beta1 in #44343 (comment) or #44343 (comment)?

For me, Go tip still has the same problem, output from code in #44343 (comment):

  avg 15.2ms;  min 2.1µs;  p50 15.4ms;  max 15.9ms;
  p90 15.6ms;  p99 15.7ms;  p999 15.9ms;  p9999 15.9ms;
      2.1µs [ 15] ▌
        2ms [  1]
        4ms [  0]
        6ms [  0]
        8ms [  0]
       10ms [  0]
       12ms [  0]
       14ms [984] ████████████████████████████████████████
       16ms [  0]
       18ms [  0]

And the _example/basic:

5.018ms
  avg 15.3ms;  min 2.2µs;  p50 15.4ms;  max 1.01s;
  p90 15.5ms;  p99 15.7ms;  p999 15.9ms;  p9999 1.01s;
      2.2µs [  90] ▌
        2ms [   0]
        4ms [   0]
        6ms [   0]
        8ms [   0]
       10ms [   0]
       12ms [   0]
       14ms [4004] ████████████████████████████████████████
       16ms [   1]
       18ms+[   1]

I don't see any changes to the runtime since go1.17beta1 that I would expect to change the behavior of time.Sleep on Windows or in general.

I don't doubt that you and me are running the same code. I think the problem is that Windows versions that we run are different.

Do you have any theories why your results differ from what I got on go1.17beta1 in #44343 (comment) or #44343 (comment)?

I managed to reproduce your problem on another PC of mine.

CL 248699 introduced use of CreateWaitableTimerExW with CREATE_WAITABLE_TIMER_HIGH_RESOLUTION (see #8687 (comment) where this idea come from). CL 248699 calls CreateWaitableTimerExW with CREATE_WAITABLE_TIMER_HIGH_RESOLUTION, and, if call succeeds, the Go runtime will use that approach for waiting in usleep.

In my repro, I can see runtime is successfully calling and using CreateWaitableTimerExW with CREATE_WAITABLE_TIMER_HIGH_RESOLUTION, but unfortunately real wait is long. Like about 10ms.

Perhaps @jstarks has an explanation of what is wrong here.

I managed to disable CL 248699 affects on my repro PC by setting highResTimerSupported in runtime to false. If I do that, my wait times are back to around 1ms. But then timeBeginPeriod is back with the vengeance, and that was whole point of CL 248699 - to stop using timeBeginPeriod.

No idea what to do here.

For me, Go tip still has the same problem, ...

I believe you. See my reply above. I can reproduce it too.

Alex

OK, @alexbrainman, it looks like we understand the problem the same way then. I agree with your analysis.

Hmm. This is unfortunate. I can look into the Windows timer code some more to see what might be going wrong. @alexbrainman, since you have seen both modes of this behavior, could you report your Windows OS build numbers for the two different machines?

@jstarks Did you see #44343 (comment)? I think I explained what's going on in detail there. What else do you think we need to look at?

Ah, sorry, I didn't. That analysis sounds correct to me.

Unfortunately there is no way to opt into a high-resolution synchronous wait (for GetQueuedCompletionStatusEx or otherwise).

I can think of a few different ways to address this:

  1. Always enable 1ms timer granularity. This isn't so bad for recent Windows versions since this has become a per-process property (at least in some configurations--not certain on the details here right now).
  2. A local fix--wait on a high-resolution timer locally in netpoll. With only local changes, you'd probably have to add a second wait, so that you first wait on the completion port + timer via WaitForMultipleObjects, then call GetQueuedCompletionStatusEx if the completion port wait succeeded. Of course, there's a race here and you'd have to loop. And this adds in a least two more syscalls per short netpoll, which may be prohibitive.
  3. A non-local fix--maintain one or more high-resolution timers to represent the next sleep wakeup time(s). Associate these with the netpoll completion port via NtCreateWaitCompletionPacket + NtAssociateWaitCompletionPacket. Keep these up to date with the next wakeup time. Then you'll get the timeout via one of the threads at random, since the IOCP is global. I don't know enough about the Go scheduler to know if this is a good idea.

None of these are particularly nice.

commented

cc @ianlancetaylor re these suggestions...

@jstarks , is there documentation on NtCreateWaitCompletionPacket and NtAssociateWaitCompletionPacket?

We do keep a high-res timer per M (minit creates it). Does already having those timers around simplify anything?

No, these APIs are currently undocumented. I'd like to fix that, but it may take some time. These are the APIs that back the CreateThreadpoolWait and SetThreadpoolWait functions, which are already documented but don't give you the level of control you want over your threads.

The basic idea is a wait completion packet allows for associating a wait operation with an IOCP. You create one per concurrent wait you'd like to support (so one per M in this case, I guess), and you start the wait by associate the packet with the IOCP and a wait object. If the object becomes signaled, the signal is consumed (for auto-reset objects), the packet's information is posted to the IOCP, and the packet is disassociated. You must reassociate the packet again to arm it for another wait (similar to EPOLLONESHOT).

You can use these with any waitable object, including timers.

@alexbrainman, since you have seen both modes of this behavior, could you report your Windows OS build numbers for the two different machines?

@jstarks I cannot reproduce this (#44343 (comment)) behaviour anymore. Makes me look like a liar. I can only see now average of 10-15 ms like everyone else.

I thought perhaps my odd result was due to some programs running on the system at the time of my test. So I built another Go program that just calls timeBeginPeriod and then goes to sleep, and run that program along with my test. But that still gives me 10-15 ms sleeps.

You should be able to easily reproduce what we see with instructions at #44343 (comment). If you have access to Windows kernel, perhaps you can figure out why high-performance timers gives us minimum sleeps of 10-15 instead of 1 ms.

Alex

If you have access to Windows kernel, perhaps you can figure out why high-performance timers gives us minimum sleeps of 10-15 instead of 1 ms.

I am not sure that's the right question given what we know. High performance timers are not involved in time.Sleep processing in Go 1.16+. Since CL232298 time.Sleep and other runtime timers rely completely on the netpoller, which on Windows relies on GetQueuedCompletionStatusEx. In #44343 (comment) @jstarks tells us

Unfortunately there is no way to opt into a high-resolution synchronous wait (for GetQueuedCompletionStatusEx or otherwise).

Which I think leaves us with either changing something in how the netpoller waits for short durations or finding a way not to rely on the netpoller for short duration sleeps or non-network related timers.

High performance timers are not involved in time.Sleep processing in Go 1.16+. Since CL232298 time.Sleep and other runtime timers rely completely on the netpoller, which on Windows relies on GetQueuedCompletionStatusEx

I was not aware of CL 232298. Indeed I can see that now time.Sleep is implemented by using GetQueuedCompletionStatusEx.

@jstarks I take my complain about high-res timers. They work as expected.

The problem is that GetQueuedCompletionStatusEx will sleep for a minimum of 15ms, if timeBeginPeriod is not called.

I suspect, if we switch timeBeginPeriod back on again, GetQueuedCompletionStatusEx will sleep for a minimum of 1 ms as before. I am not sure we want to do that.

Alex

The code from OP gives the same results like everyone else is seeing on my Windows 10 laptop, but on Windows 2019 VM it runs like this:

  avg 1.57ms;  min 114µs;  p50 1.93ms;  max 2.53ms;
  p90 2.03ms;  p99 2.53ms;  p999 2.53ms;  p9999 2.53ms;
      115µs [  1] █
      500µs [ 12] █████████████▌
        1ms [ 29] █████████████████████████████████
      1.5ms [ 35] ████████████████████████████████████████
        2ms [ 22] █████████████████████████
      2.5ms [  1] █
        3ms [  0]
      3.5ms [  0]
        4ms [  0]
      4.5ms [  0]

Does anyone know why is the difference? The executable is the same and built with 1.17 release

Since this issue hasn't been updated since Oct 5th, and we're already in the freeze, I'm not sure anything else is going to be done here for this cycle, unfortunately.

@prattmic @ChrisHines Moving to the backlog. Feel free (or anyone else) to move it back into the 1.18 milestone, or the 1.19 milestone. This should get resolved, but we need someone to dedicate time to it. As far as I can tell, the blocker is we don't have a clear fix here.

My best summary: the current timer system relies on blocking in netpoll, and netpoll's time resolution is limited on Windows. We could do something special here for short-lived timers on Windows specifically, but that's a lot of complexity and would need a clear design.

commented

Is there some way to +1 this fix and to reiterate this is breaking more than just windows? It's been preventing golang upgrades to anything > 1.15 for high cpu intensive computing applications.

It's been preventing golang upgrades to anything > 1.15 for high cpu intensive computing applications.

@jhnlsn It's not clear to me how this issue impacts high CPU intensive applications. I've reread most of this thread and didn't pick up on that, but maybe I missed something. Please elaborate on your concern.

commented

Chris - I didn't originally decide to include my performance regression case in this issue as it had looked like it was going to be fixed at some point. You are correct in that my case is not fully represented in this ticket. I first found this issue when dealing with a performance regression in my application after narrowing it down to changes in timer performance in runtime. My use case deals with situations where the compute resources are 100% utilized. In these cases go performs ~10% slower on avg with all the additional time taken in epollwait

@jhnlsn That sounds like a different enough situation to warrant a new issue to me. If I understand your situation, it does not involve timers firing late but rather a regression in the amount of overhead in the runtime for your application(s). Is that accurate?

commented

Looking back at the beginning of this thread from feb, the original discussion was that netpoller had more latency than previous implementations. This aligned with our pprof data between 1.15.x and 1.16.x, where a large portion of time was spent in netpoll. Is that not still the case with this issue or have I missed interpreted something?

My understanding is that this issue is about Go 1.16+ runtime timers having worse best case granularity than earlier versions of Go. In other words, a 50 microsecond timer always takes longer than that to fire and also takes longer to fire than on earlier versions of Go, especially on Windows. It does not require high CPU load to trigger the problem described here. It is fundamental to changes in how timers are serviced in the runtime in Go 1.16+.

I think a fix for this issue would be focussed on reducing the timer latency and would not have to reduce the runtime overhead of epollwait to address the original problem report. That's why your concern seems like a separate issue to me.

I'm not aware of other platforms aside Windows being affected by this issue (unless I missed an important comment). So, if the performance issue happens on other platforms, then it's probably something different.

@egonelbre my entire analysis was on Mac and Linux. Mac was unaffected by these changes, but Linux on both ARM and x86 was strongly impacted by a regression in Go 1.16, as I documented there.

This isn’t just about Windows.

I would also point to my comment discussing some ideas for how to move forward with resolving this regression without completely negating the original reasons that the regression was introduced, as far as I can tell.

@egonelbre Your original post shows a regression on Debian 10 from ~80µs to ~1ms. The regression is much less than on Windows, but it's still there.

The underlying issue here is that we depend on the netpoll timeout to wait for the next timer, but the netpoll timeout may have fairly coarse resolution (1ms on Linux, up to 15ms on Windows, depending on the current system timer settings). Timers that expire in less than that resolution can thus have comparatively high overshoot.

Due to the extremely coarse resolution on Windows, the issue is more extreme, but it still affects all systems to some extent.

@ChrisHines I need to find the facepalm emote. Somehow completely forgot and eyes glided over it.

In #44343 (comment), I noted that epoll_pwait2 can pretty simply resolve the problem for Linux. It is still quite new, but if we switch to that, at least there is a clear workaround for Linux: ensure you are running on a newer kernel.

@egonelbre No worries, this issue has gotten rather long and I keep having to reread to remember as well. :)

I've thrown together the mostly-untested https://golang.org/cl/363417 to prototype using epoll_pwait2 (only linux-amd64 implemented) if anyone would like to give it a try. You'll need Linux kernel >=5.11 to take the new code path.

Change https://golang.org/cl/363417 mentions this issue: runtime: use epoll_pwait2 for netpoll if available

Looking back at the beginning of this thread from feb, the original discussion was that netpoller had more latency than previous implementations. This aligned with our pprof data between 1.15.x and 1.16.x, where a large portion of time was spent in netpoll. Is that not still the case with this issue or have I missed interpreted something?

If you focus on the time spent in netpoll. Maybe this issue is related. #43997
The recheck for timers and goroutines of non-spinning M will spend a lot of time.
On our some applications, such as etcd, it have a higher CPU usage in 1.16 than 1.14. And the difference we can find of them is the time spent in netpoll. We analysed the trace data of it. In 1.16, it has more proc start and stop event. They appears when the application become idle from busy.
Our team have bockport the modification in that issue to our own go. The problem can be reaolved.
Maybe the refaction and fix of findrunnable can be backport from 1.17 to 1.16 in official release? @prattmic

By the way, could you review my CL https://go-review.googlesource.com/c/go/+/356253. It's for issue #49026.

We are testing it on our own go. But we don't have enough confidence to use it in production environment.
It's helpful if you have some advice about it.
Or it can be accepted officially. Thank you. @prattmic

commented

Dear Contributors,
we have the accuracy-problem on different Windows 10 versions (in current test: version19043.1415) after GO version 1.15.
hrtimedemo shows avg 1ms on 1.15.15, but avg 13.8 ms on 1.17.6.

This is very bad, since we use sleep and ticker funcs for period hardware measuring. I think thouse funcs should always stay in region of 1ms accuracy, otherwise not usable anymore - timer funcs are very important.
Isn't it possible to use the old (good) working fragments in 1.15.xx ?

Many thanks for your great work!

go version go1.15.15 windows/amd64
hrtimeDemo.exe
avg 1.01ms; min 77.1µs; p50 1ms; max 2.02ms;
p90 1.19ms; p99 2.02ms; p999 2.02ms; p9999 2.02ms;
77.1µs [ 2] █▌
200µs [ 0]
400µs [ 1] ▌
600µs [ 5] ████
800µs [ 34] ███████████████████████████
1ms [ 50] ████████████████████████████████████████
1.2ms [ 5] ████
1.4ms [ 0]
1.6ms [ 1] ▌
1.8ms+[ 2] █▌

go version go1.17.6 windows/amd64
hrtimeDemo.exe
avg 13.8ms; min 16.1µs; p50 15.4ms; max 16.8ms;
p90 16ms; p99 16.8ms; p999 16.8ms; p9999 16.8ms;
16.1µs [ 11] █████▌
2ms [ 0]
4ms [ 0]
6ms [ 0]
8ms [ 0]
10ms [ 0]
12ms [ 0]
14ms [ 76] ████████████████████████████████████████
16ms [ 13] ██████▌
18ms [ 0]

I want to pile in saying we are also experiencing this issue, causing us to rollback to Go 1.15 for our project (performance-sensitivity, degradation not worth any other benefits right now). This has been proven present on Linux-based distributions, so it would be great if the tagging reflected that (or removed the Windows-specific one) to help with prioritization.

I wish there was a voting for this ticket, as I would have given it a +10 ( just like @jhnlsn). The current labeling of this issue, in my view, as a Windows-specific issue is misleading.

If I read what's being written here correctly, then go 1.16 introduces several independent regressions within it's event triggering subsystem :
The first issue, is the higher CPU pressure on a CPU-intensive application; in particular, those that involve a high number of context switching and blocking operations.
This issue would affect all range of solutions across the board, and would break CPU utilization linearly ( i.e., x2 CPU won't translate into x2 work due to the newly introduced overhead ).

Second, on Linux-based systems, the entire time-based event system ( which includes time.Sleep, but also time.After and time.Timer ), was downgraded to have millisecond level accuracy.
This regression on its own could have been tolerated if "it was always like that". Unfortunately, it puts into question any previously deployed platform that would be using these calls, as these platforms would need to be re-evaluated and tested with these regressions in mind before they could be upgraded to 1.16 ( or above ).

Third, on Windows, the regression ( which I have not verified myself, and I'll refer to @woidl's excellent post to speak for itself ) is making the usage of these calls absurdly hard to justify.
A timing precision of 15ms was acceptable 25 years ago, when writing windows programs that were waiting for WM_TIMER messages. Nowadays, windows have evolved considerably and there is no justification for having this slow mechanism ( ** maybe with the exception of a UI related activity )

I do realize the fixing these issues might not be trivial; especially as @prattmic noted with his Linux v5.11 solution. Could I suggest that beside prioritizing this issue, the documentation itself would be updated to reflect the above ?
As is, the usage of these functionalities might be reasonable for coarse resolution timers, but unsuitable for high precision timer ( as was previously supported in 1.15 ). Documenting this would reduce the false expectation of these methods.

Do you agree with my conclusion above @mknyszek @ChrisHines ?

With regards to fixes, I've thought about this more today and my thoughts are:

  • For Linux, we should use epoll_pwait2, like https://go.dev/cl/363417. This system call is on the newer side, but this will improve things going forward and provide a workaround (upgrade the kernel) for particularly affected users.
  • For Windows, I think we can do something like @jstarks' (2) in #44343 (comment). We already have a per-thread high resolution timer, so netpoll would do:
    • SetWaitableTimer, to set the timeout.
    • WaitForMultipleObjects to wait for IO or timeout.
    • GetQueuedCompletionStatusEx, nonblocking, if the prior call indicated IO readiness.
    • This seems fairly straightforward, the main concern is how much more expensive is making these two extra calls?

cc @aclements

This seems fairly straightforward, the main concern is how much more expensive is making these two extra calls?

If we're willing to assume that long timers require less precision, we could somewhat reduce this cost by only using the high-resolution timer if the timeout is short, and otherwise using the low-precision timeout to the blocking GetQueuedCompletionStatusEx.

We could go even further and record the precision when a timer is enqueued, so we know at creation-time how precise we have to be once it bubbles to the top of the heap. However, this is complicated because the top timer might be a low-precision timer firing soon, immediately followed by a high-precision timer and we actually need a high-precision wait to avoid missing the second timer.

If we're willing to assume that long timers require less precision, we could somewhat reduce this cost by only using the high-resolution timer if the timeout is short, and otherwise using the low-precision timeout to the blocking GetQueuedCompletionStatusEx.

We could go even further and record the precision when a timer is enqueued, so we know at creation-time how precise we have to be once it bubbles to the top of the heap. However, this is complicated because the top timer might be a low-precision timer firing soon, immediately followed by a high-precision timer and we actually need a high-precision wait to avoid missing the second timer.

Would this involve returning to 1ms precision on the regular timer? There are probably many cases where +/- 8ms are acceptable, but I doubt they account for the majority of sleeps in any given go program.

commented

Hi, I continue with my opinion that 1ms resolution is needed.
Hope you can solve the issue (still approx 14ms resolution on Windows 10/11 with go1.17.8 windows/amd64) in near future.
Thank you all for doing a great job!

Would this involve returning to 1ms precision on the regular timer? There are probably many cases where +/- 8ms are acceptable, but I doubt they account for the majority of sleeps in any given go program.

I'm not sure what you mean by a "regular" timer. If it's a short sleep or timeout, we would use a slightly more CPU-expensive way to wait that has higher precision. Hopefully we can do that without timeBeginPeriod.

I suspect it's actually quite common for programs to have long timers. RPC-driven programs often have a large number of long-duration timers for timeouts (often on the order of minutes). They also often go idle, which would enter a blocking netpoll the same way a sleep does.

I'm not sure what you mean by a "regular" timer. If it's a short sleep or timeout, we would use a slightly more CPU-expensive way to wait that has higher precision. Hopefully we can do that without timeBeginPeriod.

I suspect it's actually quite common for programs to have long timers. RPC-driven programs often have a large number of long-duration timers for timeouts (often on the order of minutes). They also often go idle, which would enter a blocking netpoll the same way a sleep does.

I see, you're right. The main concern I have, then, is that semaphores (and therefore notesleep/notewake) are still based on the system timer granularity. I understand that this is the "slow" path, but there's a difference between 1ms futex and a 16ms (when unmodified) singleobjectwait.

Moving to 1.20 milestone.

Greetings! I am curious, what exactly got moved to the 1.20 milestone? Is the goal to get back to the 1-2ms from 1/64s, adding a high resolution timer or something entirely different?

I am currently working on a project where we are targeting the VSYNC time without using VSYNC, essentially using VSYNC without double buffering. The current solution works, but comes with a hefty performance cost. A solution that is 20 times more expensive than current Go timers, would still beat our solution by magnitudes! We have another workaround in the pipeline, but we are curious to see if you will provide us with a better alternative.

I have been following a lot of the issues related to time on Windows and every single time it comes back to the 1/64s update interval, a year passes and the issue is frozen.

On a time related note, it would be lovely to have a time function that calls Windows Performance Counters. Which I am guessing you are already using to measure benchmarks. .dll files are not exactly user friendly or beginner friendly for that matter :)

It is an absolute pleasure programming in Go, keep up the good work!

Calling timeBeginPeriod function seems to solve problem for me.

var (
	winmmDLL            = syscall.NewLazyDLL("winmm.dll")
	procTimeBeginPeriod = winmmDLL.NewProc("timeBeginPeriod")
)

func init() {
	procTimeBeginPeriod.Call(uintptr(1))
}

Hey @march1993, could you maybe provide some more details about your solution? maybe your go version and your windows version that you tested that code?

I tried reproducing your fix and didn't see any change in behavior,
providing some benchmarking would be nice too.

"solve the problem" is rather vague given the many-faceted nature of this issue, but I was able to return sleep granularity on windows 10 to 1ms with the following code (which roughly matches march1993):

package main

import (
	"fmt"
	"syscall"
	"time"
)

var (
	winmmDLL            = syscall.NewLazyDLL("winmm.dll")
	procTimeBeginPeriod = winmmDLL.NewProc("timeBeginPeriod")
)

func main() {
	procTimeBeginPeriod.Call(uintptr(1))

	for i := 0; i < 10; i++ {
		time.Sleep(time.Nanosecond)
		fmt.Println(time.Now())
	}
}

For those who were satisfied with 1ms and were upset to find the windows granularity reduced to 16ms, this will satisfy, although I don't understand the os_windows.go code very well, and I'm worried that some circumstances may return the granularity to 16ms while the program is in-flight.

For those of us who concerned about the occasional reduction in darwin/linux timer granularity, this will obviously do nothing. For those of us who were hoping to see the improvements discussed above used to bring the windows timer granularity up to that of darwin/linux, this also does not help.

It seems relevant to note that @rsc cited this issue in https://research.swtch.com/telemetry-uses as a surprising instance of a bug that might keep Windows users from upgrading from Go 1.15.

Hi @qmuntal, I'm curious if you have any thoughts on the Windows piece of this, especially option 2 outlined by @jstarks in #44343 (comment), and the follow-up comment by @prattmic in #44343 (comment) saying that option 2 looked promising for Windows but wondering about the overhead cost.

This is a very long issue at this point, and the discussion was likely complicated by the fact that there is some impact on Linux but much bigger impact on Windows.

If you don't want to read everything, here are some sample highlights from the discussion:

  • Here's an example concrete problem report from @freb in a comment above:

    The gist is that rate limiters (golang.org/x/time/rate, go.uber.org/ratelimit, etc.) are unusable in v1.16.3 on Windows for even modest rates (hundreds of tokens per second). v1.15.11 can do a 10x higher rate. For my use case (rates less than 15,000), there is no impact on Linux.

  • Russ wrote in the one of the telemetry posts:

    For example, I recently learned about a Windows timer problem (#44343) that is keeping some Windows users on Go 1.15.

  • For some concrete numbers, including possibly some variation by Windows OS, you could look at @ChrisHines comments here and here, and the response to that by @alexbrainman here.