githubnemo / CompileDaemon

Very simple compile daemon for Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

filepath.Walk():bad file descriptor

wyatt-troia opened this issue · comments

I successfully ran CompileDaemon a month ago when I downloaded it, but now whenever I try to run it I get:

filepath.Walk():bad file descriptor

I've tried deleting the src and bin files for CompileDaemon and running go get "github.com/githubnemo/CompileDaemon", but it didn't change anything.

I'm on a 2017 Macbook Pro.

Sorry, I cannot reproduce this as is. Can you provide more detail? What are the directory contents? Is this inside a container? A mounted FS? What's the OS?

Same error on macOS BigSur 11.1. It works without error on docker (for mac).

In my environment, adding -directory foobar option solved that error.
Maybe it is just "too many open files".

I ran into this same issue (macOS Big Sur 11.2.1). I'm not sure exactly what the cause is, but I traced it down to a function called register() in kqueue.go, which calls unix.Kevent which fails. On my mac, this function is defined in syscall_bsd.go, which calls kevent() in a generated file named zsyscall_darwin_amd64.go. The bug appears to be way deep inside the OS, and its beyond me to begin debugging the root cause.

Happily, I discovered that this call hierarchy is only invoked when using the NotifyWatcher. If I use the PollingWatcher instead (CompileDaemon --polling=true ...), the error goes away and all works as expected. Maybe worth adding to the README as a known issue.

I ran into this same issue (macOS Big Sur 11.2.1). I'm not sure exactly what the cause is, but I traced it down to a function called register() in kqueue.go, which calls unix.Kevent which fails. On my mac, this function is defined in syscall_bsd.go, which calls kevent() in a generated file named zsyscall_darwin_amd64.go. The bug appears to be way deep inside the OS, and its beyond me to begin debugging the root cause.

Good job tracing the issue back to the syscall level. Without specifics it is hard to reason about what is going wrong but kqueue fails if the file descriptor it is supposed to watch is invalid. Whatever that means. I tried looking up FreeBSD's kqueue implementation as it is likely to be similar but there was no obvious place where the EBADF is coming from in this case. If I had to guess it is either a weird file type or has something to do with the underlying filesystem. Any info in this direction?

Happily, I discovered that this call hierarchy is only invoked when using the NotifyWatcher. If I use the PollingWatcher instead (CompileDaemon --polling=true ...), the error goes away and all works as expected. Maybe worth adding to the README as a known issue.

Thanks for the suggestion, there is already a section in the README about Mac OS X + polling. Is there something missing?

Good job tracing the issue back to the syscall level. Without specifics it is hard to reason about what is going wrong but kqueue fails if the file descriptor it is supposed to watch is invalid. Whatever that means.

Lol, that was my thinking as well. "Whatever that means".

If I had to guess it is either a weird file type or has something to do with the underlying filesystem. Any info in this direction?

I wish I could be of more help but I wasn't able to figure anything out in this regard. I did try i.e. --exclude-dir=".git", as well as some other dirs, however none of them solved the issue. I didn't go through that approach with much rigor, though, and doing so could lead to further evidence (i.e. excluding all directories in my repo and then adding them back 1 by 1 until I trigger the bug). I wish I could point you to the repo that this came up for me in so that you could investigate further (assuming you have access to a macOS system) however its been made private by the company I work for (obligatory check us out at https://goteleport.com!). No promises about the timeline, but I will add this to my running todo list to do it myself. Alternately or in concert, perhaps @wyatt-troia or @ypresto have an open source repo to point you at that you can use to repro.

Thanks for the suggestion, there is already a section in the README about Mac OS X + polling. Is there something missing?

Ah, indeed there is. Nope no suggestions, I just need to read the README more carefully next time.

If I had to guess it is either a weird file type or has something to do with the underlying filesystem. Any info in this direction?

I wish I could be of more help but I wasn't able to figure anything out in this regard. I did try i.e. --exclude-dir=".git", as well as some other dirs, however none of them solved the issue. I didn't go through that approach with much rigor, though, and doing so could lead to further evidence (i.e. excluding all directories in my repo and then adding them back 1 by 1 until I trigger the bug).

I understand :) Just a quick check: there were no special files like FIFOs, symlinks or unusually big files involved and no special file systems (like for example, running inside a container or a VM)?

Just a quick check: there were no special files like FIFOs, symlinks or unusually big files involved and no special file systems (like for example, running inside a container or a VM)?

There are no FIFOs and the largest individual file is 32K. One directory has symlinks but excluding it does not fix the bug.

Update:

I have an update for you, having just spent some time this morning investigating this. The approach I took initially was -- for each directory in the repo, navigate into that directory and run a CompileDaemon command.

This was fruitful in that I identified several directories .git, foo/, bar/, baz/, and foobar/ which exited with an error, the first 4 with watcher.Addfiles():filepath.Walk(): fw.add(path): bad file descriptor and the last with watcher.Addfiles():filepath.Walk(): open github.com/aws/aws-sdk-go/aws/ec2metadata: too many open files.

As a sanity check I next ran the CompileDaemon command from the top level of my repo with a -exclude-dir option for each of the identified directories and... I got another watcher.Addfiles():filepath.Walk(): fw.add(path): bad file descriptor. Well, I figured, there are some additional files and dot files in my top level directory, so perhaps its one of those causing the issue. I moved all of those out of the repo into a temporary folder elsewhere, rand the command again and... same result. At this point I was a bit stumped, and ran some sanity checks to confirm that my -exclude-dir's were formatted properly and working (they were), and thus moved on to an even more meticulous approach.

I took all of the files and directories out of the repo, and then began adding them back in 1 by 1, running CompileDaemon each time to check whether I got one of the errors of interest. This led to another breakthrough: I identified a new directory qux/ which, when added back, caused the same "bad file descriptor" error.

The interesting thing here is, if I navigate into qux/ and run the CompileDaemon command, I don't get this error! CompileDaemon runs as expected! (It gives me another error since qux/ is not a go directory and so the build command doesn't work, but I'm presuming that would only happen after the section of code that's throwing these errors has run successfully). Huh? So it seems to me that the program attempting to read the directory itself is what's causing the error.

As far as I can tell, there is nothing out of the ordinary about this directory. It is a perfectly normal directory of regular size with the same attributes as the other directories currently in the repo that aren't causing this issue:

$ ls -li
total 0
848305 drwxr-xr-x   6 ibeckermayer  staff  192 Apr 15 15:05 wibble
848321 drwxr-xr-x   5 ibeckermayer  staff  160 May  8 09:28 wobble
848329 drwxr-xr-x  11 ibeckermayer  staff  352 May  8 09:28 qux
848744 drwxr-xr-x  22 ibeckermayer  staff  704 May  8 09:28 wubble

At this point I'm thoroughly stumped, seeking suggestions of what I might try next.

Thank you so much for the debugging effort, this is quality information!

It might of course be that there is something special going on for that directory (or your file system/disk is broken :))
but for the sake of our sanity let's go through some more favorable theories/questions:

  • could it be that it is just the number of files that is exceeded at some point? Maybe there's an ulimit in place that triggers a check which fails and as a consequence the file descriptor is deemed 'bad'?
  • maybe the directory qux is not important and any other sufficiently 'large' (in terms of number of files contained or possibly the depth of subdirectories) directory might suffice, such as the .git directory for example - can you simply swap qux with .git and the same error pops out?
  • maybe there's some race condition of sorts - is some other program writing to qux simultaneously?
  • what makes qux special (if it really is)? does stat qux show something wildly different than for another 'working' directory?

Thanks again for putting time and effort into this :)