[linux] please provide parameter that allows to control how `cmdline()` splits
calestyo opened this issue · comments
Summary
- OS: linux
- Type: core
Description
Hey.
While skimming through the code because of #2334 I've stumbled over:
https://github.com/giampaolo/psutil/blob/master/psutil/_pslinux.py#L1783-L1798
and #1179.
I think doing automagically this actually a problem (and depending on how the results are used, may even be a security issue, or e.g. lead to killing of wrong processes).
Simply splitting by spaces, just because some heuristics thinks it would have found out that arguments were split by them, may or may not be correct.
Consider the simple example:
$ cp -a /bin/bash 'foo bar'
$ ./foo\ bar
Then one has:
$ hd /proc/878729/cmdline
00000000 2e 2f 66 6f 6f 20 62 61 72 00 |./foo bar.|
0000000a
which already fits the condition. Trailing 0x0
, one list element, space contained.
So the user will get wrongly a two element list.
Another simple example might be:
unzip t x.zip
I is unzip
t
x.zip
or is it unzip
t x.zip
or even unzip t
x.zip
.
If programs change their cmdline, anything can happen... and I don't think it's good to further mess that up.
The main problem is however, that the caller of cmdline()
cannot know whether this happened or not, so he cannot even undo it.
My suggest would be:
cmdline()
should be safe, and return the contents split by 0x0
only.
But there could be an parameter for it, which enables the splitting by space in the (presumed) well known cases.
Such parameter would even allow to apply different styles of splitting.
So one could have: split=raw
which gives back the unmodified bytes
(no splitting at all), split="null"
(which IMO should be default), which splits by 0x0
only, and possibly further ones including the single-element with spaces case.
Thanks,
Chris.
PS: Is it really possibly that cmdline
does not end in 0x0
(even after a process has modified it)? I wasn't able to get that done... any examples?
I'm afraid adding a parameter which works on Linux only is a no-go from an API standpoint. A user interested in getting the raw cmdline for whatever reason should just read /proc/pid/cmdline. Still, he would have the exact same problem we have: how to interpret it?
IMO we can only improve the internal heuristic, assuming it can be improved (I don't know). As for your examples: I don't understand the hd
output. If we want to discuss examples then let's please use something like this instead:
$ python3 -c 'print(repr(open("/proc/self/cmdline").read()))'
'python3\x00-c\x00print(repr(open("/proc/self/cmdline").read()))\x00'
...so I can clearly see where the separators are. Can you re-paste the examples above by using this instead of hd
? I would like to understand how likely / realistic it is the bump into a wrong interpretation of the cmdline separators.
Well its the nature of such deep system information, that it may be platform dependent.
And things get worse, when even within a platform, there's no definite answer.
Linux e.g. says in proc(5)
:
/proc/pid/cmdline
This read-only file holds the complete command line for the
process, unless the process is a zombie. In the latter case,
there is nothing in this file: that is, a read on this file will
return 0 characters. The command-line arguments appear in this
file as a set of strings separated by null bytes ('\0'), with a
further null byte after the last string.
If, after an execve(2), the process modifies its argv strings,
those changes will show up here. This is not the same thing as
modifying the argv array.
Furthermore, a process may change the memory location that this
file refers via prctl(2) operations such as PR_SET_MM_ARG_START.
Think of this file as the command line that the process wants
you to see.
So it may be the "original" program name & arguments separated (and terminated) by 0x0
, or anything else (not sure whether it can be non-0x0
terminated though, but let's assume it for safety).
As the manpage says, it may not even be valid to think of it as a "command line" in the sense of a program name followed by arguments (which I think, the current heuristic tries to employ however).
It could be the current position based on GPS and local weather report.
hd
output was just a hex print, left side the addresses, middle hex values, right side their ASCII representation (except for control characters).
For the example I had given above, your command would print:
$ python3 -c 'print(repr(open("/proc/53755/cmdline").read()))'
'./foo bar\x00'
But I don't see why such parameter would only work on Linux? raw
would just return whatever the source of information gives, no further processing (i.e. on Linux it would be simply the raw bytes
contents of /proc/<pid>/cmdline
).
I'd also tend to say, that at least raw
should return bytes
and no str
.
What I've had named null
above (bad name, better perhaps something like native
), would be the "natively" (and perhaps also naively ;-) ) split, command line.
In case of Linux, that would IMO be simply /proc/<pid>/cmdline
split at 0x0
bytes. Really just that, no further "optimisations". That is, if there were were no 0x0
at all, it would still give only one field. If there were multiplie consecutive 0x0
e.g. /usr/bin/ssh\0\0\0
it would give [b'/usr/bin/ssh', b'', b'', b'']
. And there would be no magic trying to split at spaces o so.
I don't know what other platforms provide in terms of data, but for another platform, that e.g. doesn't use 0x0
, native
would still return what the platform would return as program name and arguments. E.g. if windows would have a syscall for that, which already returns an array,.. then this would be what it uses.
Last but not least, there could be some guess
(or a better name) mode, which does what's done right now: (hopefully) smart guesswork, e.g. trying to split on spaces and trying to find out, what the command name is.
But there should be a warning that this may also fail miserably, e.g. as in the examples I gave above with the various ways one could interpret unzip t x.zip
.
Last but not least, in a way this issue may also be a tiny bit "security" relevant (similar to #2334).
Well not exactly like that actually. With #2334, the main problem is that the kernel’s /proc/<pid>/exe
would be a trustworthy value, while /proc/<pid>/cmdline
is known to be not.
But what can still happen is e.g. the following scenario:
I'm stumbled over using psutil
and over these issues, when I was starting to re-implement an old shell script of mine, that kills OpenSSH [mux]
processes. When the network dies or changes the IP, these processes would still remain open for some timeout time, and during that even new SSH connections would fail (as they'd try to use the [mux]
process for it.
As my tool, would kill processes, I'm trying to make sure it's really an OpenSSH [mux]
process and not just one that might accidentally look like one. So for once I check exe
and then also cmdline
which must match some specific pattern. And actually there's even one more check, using a string extracted from cmdline
.
However, if the results of cmdline
are processed by some heuristics, this may quite easily falsify my checks, and I may even kill the wrong processes (or - less problematic - not kill the right ones).
Thanks,
Chris.
But I don't see why such parameter would only work on Linux? raw would just return whatever the source of information gives, no further processing (i.e. on Linux it would be simply the raw bytes contents of /proc//cmdline).
As I said above, because even with the original raw bytes you would still have the same problem of not knowing what the separator is, so you would end up guessing anyway. You can go back to the raw cmdline by converting the string into bytes if that's what you want:
is_possible_tmux = b"tmux" in bytes(" ".join(psutil.Process(pid).cmdline()), "utf8")
The trailing separator is trimmed, but other than that no information is lost.
Realistically I don't think any of this matter though. cmdline()
is not supposed to be an API to uniquely identify a process (heck, it can even be altered!). What you typically do is convert cmdline()
into string and search for a match ("tmux" in your case). Usually you also add name()
and exe()
to the mix to make it more robust, but even then you'll still have no guarantees that you're killing the right process. It's similar to using ps
/ top
followed by killall
basically.
With #2334, the main problem is that the kernel’s /proc//exe would be a trustworthy value, while /proc//cmdline is known to be not.
Agreed. exe()
is different than cmdline()
and name()
, and should give more guarantees.