evolution of `grab_stdout` to `grab_all`
lePereT opened this issue · comments
Hi Xopxe
There are a few things about the selector.grab_stdout
tool that don't work for me:
- Doesn't grab
stderr
which many tools such aswget
use extensively - Doesn't provide feedback on when a long(ish) running command has completed and no way of finding the exit code of the program that has been run
- Leaks FDs (because
waitpid
must be called on child processes otherwise they become zombie processes)
I've put together a demo of what a more comprehensive function might look like. Some things to note:
- I use a variation of the double fork technique. The grandchild process is the one that executes the program and its direct parent (the child process) calls a blocking
nixio.waitpid()
on the grandchild's pid. Once this returns it sends the return code (the marker of success or failure) on a stream, much like thestdout
andstderr
streams - This uses a more complex input table to avoid code repetition
- It should be easy to alter
grab_stdout
to use a function like this under the hood
Okay the function looks like this:
M.grab_all = function(command, handles)
local function run_shell_nixio(command)
local out_r, out_w = nixio.pipe()
local err_r, err_w = nixio.pipe()
local ret_r, ret_w = nixio.pipe()
local childpid = nixio.fork()
assert(childpid ~= nil, errmsg)
if childpid > 0 then --parent
out_w:close()
err_w:close()
ret_w:close()
return {out_r=out_r, err_r=err_r, ret_r=ret_r}, pid
else
local grandchildpid = nixio.fork()
assert(grandchildpid ~= nil, errmsg)
if grandchildpid > 0 then -- child
out_r:close(); out_w:close()
err_r:close(); err_w:close()
ret_r:close()
local _, _, exit_code = nixio.waitpid(grandchildpid)
ret_w:write(exit_code)
ret_w:close()
os.exit()
else -- grandchild
nixio.dup(out_w, nixio.stdout)
nixio.dup(err_w, nixio.stderr)
out_r:close(); out_w:close()
err_r:close(); err_w:close()
ret_r:close(); ret_w:close()
nixio.exec("/bin/sh", "-c", command)
end
end
end
local descriptors, pid = run_shell_nixio(command)
local sockets = {}
for k, v in pairs(descriptors) do
local sktd=init_sktd({
pattern = normalize_pattern(handles[k].pattern),
handler = handles[k].handler,
fd = descriptors[k],
events = descriptors[k]
})
-- sktd.events = {data=sktd.fd}
if sktd.pattern=='*l' and handles[k].handler == 'stream' then sktd.pattern=nil end
register_client(sktd)
sockets[k] = sktd
end
return sockets, pid
end
An example of how it works is like this:
--look for packages one folder up.
package.path = package.path .. ";;;../../?.lua;../../?/init.lua"
local sched = require 'lumen.sched'
local stream = require 'lumen.stream'
local selector = require 'lumen.tasks.selector'
local nixio = require 'nixio'
selector.init({service='nixio'})
local handles = {
out_r = {
pattern = "*a",
handler = stream.new()
},
err_r = {
pattern = "*a",
handler = stream.new()
},
ret_r = {
pattern = "*a",
handler = stream.new()
},
}
-- task that issues command
sched.run(function()
local sktd, pid = selector.grab_all('ping -c 5 8.8.8.8', handles)
local a = handles.ret_r.handler:read()
if a ~= nil then
print("return code: "..a)
end
nixio.waitpid(pid)
print("function completed!")
end)
-- task receives ping std_out
sched.run(function()
while true do
local a, b, c = handles.out_r.handler:read()
if a ~= nil then
print("stdout: "..a)
else
print("returning from stdout wait")
return
end
end
end)
-- task receives ping std_err
sched.run(function()
while true do
local a, b, c = handles.err_r.handler:read()
if a ~= nil then
print("stderr: "..a)
else
print("returning from stderr wait")
return
end
end
end)
sched.loop()
This is very much a first attempt. Would love your input. I'm sure this could be made more elegant (through functional composition perhaps to execute the double fork and waiting on the grandchild pid), but it works well and allows for efficient execution of external commands (no need to poll on the pid to ascertain program completion). Would love your thoughts on this before I submit a PR.
Also a very cool addition would be to bind stdin
to this process, perhaps through a stream as well so that we can take full input/output command of external programs. I tried extending this just by adding another handler specified the same as the other 3 but just managed to achieve 100% CPU utilisation, which is no fun - I suspect you understand posix sockets much better than me :)
This is very nice. I don't see an obvious way to "improve" it. There's something that caught my attention, tough. nixio.waitpd
is blocking, no? This should halt the whole application until something happens with that pid. Perhaps the waitpd
should be polled in a loop with a nohang
parameter, with a smallish sched.sleep
between calls?
(also, the asserts after the fork() calls, they do nothing?)
I've thought about streaming into stdin
in the past, but as I didn't need it at the time this just was left behind. Now looks like a good opportunity o look again into it.
The sockets are messy, and the selector api is overcomplicated :(
Thank you!
In my code I use waitpid
twice:
- In the
grab_all
function itself,waitpid
is called by the child process on the pid of the grandchild process. This is fine because this isn't running in the lumen process, we want this one to block :) - In the example lumen process code I only call
waitpid
on the pid of the child process once we've received output from the stream bound to the return code. This only happens once thewaitpid
in the child process has returned and the child itself has sent the return code. At this point all the child has left to do is callos.exit()
and so the block, if any, would be miniscule.
Actually for 2. I tested with the following code:
sched.run(function()
local sktd, pid = selector.grab_all('ping -c 5 8.8.8.8', handles)
local a = handles.ret_r.handler:read()
if a ~= nil then
print("return code: "..a)
end
local start = sched.get_time()
local method = "non-blocking"
local attempts = 1
if method == "non-blocking" then
while true do
local _, state, code = nixio.waitpid(pid, "nohang")
if state == "exited" then break end
attempts = attempts + 1
sched.sleep(0.001)
end
elseif method == "blocking" then
local state, code = nixio.waitpid(pid)
end
local time_taken = sched.get_time() - start
print(attempts, "attempts were required at waitpid, took ",time_taken, "seconds" )
print("function completed!")
end)
a typical result for method = "non-blocking"
was
5 attempts were required at waitpid, took 0.0043909549713135 seconds
and a typical result for method = "blocking"
was
1 attempts were required at waitpid, took 0.002899169921875 seconds
These times varied around 0.004 seconds for each approach. So the non-blocking approach might well be worth doing!
Re the Selector API, as I gain more experience with it I'll see if I can make any suggestions. As I've mentioned before, when I have time I'm studying the Snabb implementation of concurrent ML in Lua (#23 (comment)). This library integrates with epoll
rather than poll
and select
but that should be a detail.
Only my spidey senses tingling at the moment but I suspect it would be really nice to maybe build a task/activity based multitasking system like Lumen (which I find immensely intuitive) on the basis of CML operations. Something that avoids the inherent ugliness of coloured functions (async/sync versions of everything) but can build more natural structured concurrency.
A quick note, is the waitpid in the main program neccesary? Can't you detect the 'closed' error when reading from the stream?
We need to waitpid in the main program to avoid zombie processes hanging around eating up process descriptors https://duyanghao.github.io/ways_avoid_zombie_process/
Oh I see.
The fact that the library's user has call nixio.waitpid outside grab_all, in the main program, is not very safe/elegant.
Have you tried setting the SIGCHLD signal to SIG_IGN (the "3rd method" in the link you shared) using nixio.signal? As I understand it, you could do it inside grab_all before the fork.