evolution of `grab_stdout` to `grab_all`

Question

evolution of `grab_stdout` to `grab_all`

lePereT opened this issue 2 years ago · comments

Richard Thanki commented 2 years ago

Hi Xopxe

There are a few things about the selector.grab_stdout tool that don't work for me:

Doesn't grab stderr which many tools such as wget use extensively
Doesn't provide feedback on when a long(ish) running command has completed and no way of finding the exit code of the program that has been run
Leaks FDs (because waitpid must be called on child processes otherwise they become zombie processes)

I've put together a demo of what a more comprehensive function might look like. Some things to note:

I use a variation of the double fork technique. The grandchild process is the one that executes the program and its direct parent (the child process) calls a blocking nixio.waitpid() on the grandchild's pid. Once this returns it sends the return code (the marker of success or failure) on a stream, much like the stdout and stderr streams
This uses a more complex input table to avoid code repetition
It should be easy to alter grab_stdout to use a function like this under the hood

Okay the function looks like this:

  M.grab_all = function(command, handles)
    local function run_shell_nixio(command)
      local out_r, out_w = nixio.pipe()
      local err_r, err_w = nixio.pipe()
      local ret_r, ret_w = nixio.pipe()
      local childpid = nixio.fork()
      assert(childpid ~= nil, errmsg)
      if childpid > 0 then --parent
        out_w:close()
        err_w:close()
        ret_w:close()
        return {out_r=out_r, err_r=err_r, ret_r=ret_r}, pid
      else
        local grandchildpid = nixio.fork()
        assert(grandchildpid ~= nil, errmsg)
        if grandchildpid > 0 then -- child
          out_r:close(); out_w:close()
          err_r:close(); err_w:close()
          ret_r:close()
          local _, _, exit_code = nixio.waitpid(grandchildpid)
          ret_w:write(exit_code)
          ret_w:close()
          os.exit()
        else -- grandchild
          nixio.dup(out_w, nixio.stdout)
          nixio.dup(err_w, nixio.stderr)
          out_r:close(); out_w:close()
          err_r:close(); err_w:close()
          ret_r:close(); ret_w:close()
          nixio.exec("/bin/sh", "-c", command)
        end
      end
    end
    local descriptors, pid = run_shell_nixio(command)
    local sockets = {}
    for k, v in pairs(descriptors) do
      local sktd=init_sktd({
        pattern = normalize_pattern(handles[k].pattern),
        handler = handles[k].handler,
        fd = descriptors[k],
        events = descriptors[k]
      })
      -- sktd.events = {data=sktd.fd}
      if sktd.pattern=='*l' and handles[k].handler == 'stream' then sktd.pattern=nil end
      register_client(sktd)
      sockets[k] = sktd
    end
    return sockets, pid
  end

An example of how it works is like this:

--look for packages one folder up.
package.path = package.path .. ";;;../../?.lua;../../?/init.lua"

local sched = require 'lumen.sched'
local stream = require 'lumen.stream'
local selector = require 'lumen.tasks.selector'
local nixio = require 'nixio'

selector.init({service='nixio'})

local handles = {
    out_r = {
        pattern = "*a",
        handler = stream.new()
    },
    err_r = {
        pattern = "*a",
        handler = stream.new()
    },
    ret_r = {
        pattern = "*a",
        handler = stream.new()
    },
}

-- task that issues command
sched.run(function()
    local sktd, pid = selector.grab_all('ping -c 5 8.8.8.8', handles)
    local a = handles.ret_r.handler:read()
    if a ~= nil then
        print("return code: "..a)
    end
    nixio.waitpid(pid)
    print("function completed!")
end)

-- task receives ping std_out
sched.run(function()
	while true do
        local a, b, c = handles.out_r.handler:read()
        if a ~= nil then
            print("stdout: "..a)
        else
            print("returning from stdout wait")
            return
        end
    end
end)

-- task receives ping std_err
sched.run(function()
	while true do
        local a, b, c = handles.err_r.handler:read()
        if a ~= nil then
            print("stderr: "..a)
        else
            print("returning from stderr wait")
            return
        end
    end
end)

sched.loop()

This is very much a first attempt. Would love your input. I'm sure this could be made more elegant (through functional composition perhaps to execute the double fork and waiting on the grandchild pid), but it works well and allows for efficient execution of external commands (no need to poll on the pid to ascertain program completion). Would love your thoughts on this before I submit a PR.

Also a very cool addition would be to bind stdin to this process, perhaps through a stream as well so that we can take full input/output command of external programs. I tried extending this just by adding another handler specified the same as the other 3 but just managed to achieve 100% CPU utilisation, which is no fun - I suspect you understand posix sockets much better than me :)

Jorge · Answer 1 · Tue Dec 27 2022 00:07:11 GMT+0800 (China Standard Time)

This is very nice. I don't see an obvious way to "improve" it. There's something that caught my attention, tough. nixio.waitpd is blocking, no? This should halt the whole application until something happens with that pid. Perhaps the waitpd should be polled in a loop with a nohang parameter, with a smallish sched.sleep between calls?
(also, the asserts after the fork() calls, they do nothing?)

I've thought about streaming into stdin in the past, but as I didn't need it at the time this just was left behind. Now looks like a good opportunity o look again into it.

The sockets are messy, and the selector api is overcomplicated :(

Richard Thanki · Answer 2 · Tue Dec 27 2022 10:15:32 GMT+0800 (China Standard Time)

Thank you!

In my code I use waitpid twice:

In the grab_all function itself, waitpid is called by the child process on the pid of the grandchild process. This is fine because this isn't running in the lumen process, we want this one to block :)
In the example lumen process code I only call waitpid on the pid of the child process once we've received output from the stream bound to the return code. This only happens once the waitpid in the child process has returned and the child itself has sent the return code. At this point all the child has left to do is call os.exit() and so the block, if any, would be miniscule.

Actually for 2. I tested with the following code:

sched.run(function()
    local sktd, pid = selector.grab_all('ping -c 5 8.8.8.8', handles)
    local a = handles.ret_r.handler:read()
    if a ~= nil then
        print("return code: "..a)
    end
    local start = sched.get_time()
    local method = "non-blocking"
    local attempts = 1
    if method == "non-blocking" then
        while true do
            local _, state, code = nixio.waitpid(pid, "nohang")
            if state == "exited" then break end
            attempts = attempts + 1
            sched.sleep(0.001)
        end
    elseif method == "blocking" then
        local state, code = nixio.waitpid(pid)
    end
    local time_taken = sched.get_time() - start
    print(attempts, "attempts were required at waitpid, took ",time_taken, "seconds" )
    print("function completed!")
end)

a typical result for method = "non-blocking" was

5       attempts were required at waitpid, took         0.0043909549713135      seconds

and a typical result for method = "blocking" was

1       attempts were required at waitpid, took         0.002899169921875       seconds

These times varied around 0.004 seconds for each approach. So the non-blocking approach might well be worth doing!

Re the Selector API, as I gain more experience with it I'll see if I can make any suggestions. As I've mentioned before, when I have time I'm studying the Snabb implementation of concurrent ML in Lua (#23 (comment)). This library integrates with epoll rather than poll and select but that should be a detail.

Only my spidey senses tingling at the moment but I suspect it would be really nice to maybe build a task/activity based multitasking system like Lumen (which I find immensely intuitive) on the basis of CML operations. Something that avoids the inherent ugliness of coloured functions (async/sync versions of everything) but can build more natural structured concurrency.

Jorge · Answer 3 · Tue Dec 27 2022 11:03:21 GMT+0800 (China Standard Time)

A quick note, is the waitpid in the main program neccesary? Can't you detect the 'closed' error when reading from the stream?

Richard Thanki · Answer 4 · Tue Dec 27 2022 11:22:39 GMT+0800 (China Standard Time)

We need to waitpid in the main program to avoid zombie processes hanging around eating up process descriptors https://duyanghao.github.io/ways_avoid_zombie_process/

Jorge · Answer 5 · Thu Jan 12 2023 00:46:21 GMT+0800 (China Standard Time)

Oh I see.

The fact that the library's user has call nixio.waitpid outside grab_all, in the main program, is not very safe/elegant.
Have you tried setting the SIGCHLD signal to SIG_IGN (the "3rd method" in the link you shared) using nixio.signal? As I understand it, you could do it inside grab_all before the fork.