[linux] please provide parameter that allows to control how `cmdline()` splits

Question

[linux] please provide parameter that allows to control how `cmdline()` splits

calestyo opened this issue 6 months ago · comments

Christoph Anton Mitterer commented 6 months ago

Summary

OS: linux
Type: core

Description

Hey.

While skimming through the code because of #2334 I've stumbled over:
https://github.com/giampaolo/psutil/blob/master/psutil/_pslinux.py#L1783-L1798
and #1179.

I think doing automagically this actually a problem (and depending on how the results are used, may even be a security issue, or e.g. lead to killing of wrong processes).

Simply splitting by spaces, just because some heuristics thinks it would have found out that arguments were split by them, may or may not be correct.

Consider the simple example:

$ cp -a /bin/bash 'foo bar'
$ ./foo\ bar

Then one has:

$ hd /proc/878729/cmdline 
00000000  2e 2f 66 6f 6f 20 62 61  72 00                    |./foo bar.|
0000000a

which already fits the condition. Trailing 0x0, one list element, space contained.

So the user will get wrongly a two element list.

Another simple example might be:

unzip t x.zip

I is unzip t x.zip or is it unzip t x.zip or even unzip t x.zip.

If programs change their cmdline, anything can happen... and I don't think it's good to further mess that up.

The main problem is however, that the caller of cmdline() cannot know whether this happened or not, so he cannot even undo it.

My suggest would be:
cmdline() should be safe, and return the contents split by 0x0 only.

But there could be an parameter for it, which enables the splitting by space in the (presumed) well known cases.

Such parameter would even allow to apply different styles of splitting.

So one could have: split=raw which gives back the unmodified bytes (no splitting at all), split="null" (which IMO should be default), which splits by 0x0 only, and possibly further ones including the single-element with spaces case.

Thanks,
Chris.

PS: Is it really possibly that cmdline does not end in 0x0 (even after a process has modified it)? I wasn't able to get that done... any examples?

Giampaolo Rodola · Answer 1 · Mon Dec 11 2023 03:10:30 GMT+0800 (China Standard Time)

I'm afraid adding a parameter which works on Linux only is a no-go from an API standpoint. A user interested in getting the raw cmdline for whatever reason should just read /proc/pid/cmdline. Still, he would have the exact same problem we have: how to interpret it?

IMO we can only improve the internal heuristic, assuming it can be improved (I don't know). As for your examples: I don't understand the hd output. If we want to discuss examples then let's please use something like this instead:

$ python3 -c 'print(repr(open("/proc/self/cmdline").read()))'
'python3\x00-c\x00print(repr(open("/proc/self/cmdline").read()))\x00'

...so I can clearly see where the separators are. Can you re-paste the examples above by using this instead of hd? I would like to understand how likely / realistic it is the bump into a wrong interpretation of the cmdline separators.

Christoph Anton Mitterer · Answer 2 · Tue Dec 12 2023 07:08:07 GMT+0800 (China Standard Time)

Well its the nature of such deep system information, that it may be platform dependent.

And things get worse, when even within a platform, there's no definite answer.

Linux e.g. says in proc(5):

       /proc/pid/cmdline
              This read-only file holds the  complete  command  line  for  the
              process,  unless  the  process is a zombie.  In the latter case,
              there is nothing in this file: that is, a read on this file will
              return 0 characters.  The command-line arguments appear in  this
              file  as a set of strings separated by null bytes ('\0'), with a
              further null byte after the last string.

              If, after an execve(2), the process modifies its  argv  strings,
              those  changes will show up here.  This is not the same thing as
              modifying the argv array.

              Furthermore, a process may change the memory location that  this
              file refers via prctl(2) operations such as PR_SET_MM_ARG_START.

              Think  of  this  file as the command line that the process wants
              you to see.

So it may be the "original" program name & arguments separated (and terminated) by 0x0, or anything else (not sure whether it can be non-0x0 terminated though, but let's assume it for safety).

As the manpage says, it may not even be valid to think of it as a "command line" in the sense of a program name followed by arguments (which I think, the current heuristic tries to employ however).

It could be the current position based on GPS and local weather report.

hd output was just a hex print, left side the addresses, middle hex values, right side their ASCII representation (except for control characters).

For the example I had given above, your command would print:

$ python3 -c 'print(repr(open("/proc/53755/cmdline").read()))'
'./foo bar\x00'

But I don't see why such parameter would only work on Linux? raw would just return whatever the source of information gives, no further processing (i.e. on Linux it would be simply the raw bytes contents of /proc/<pid>/cmdline).

I'd also tend to say, that at least raw should return bytes and no str.

What I've had named null above (bad name, better perhaps something like native), would be the "natively" (and perhaps also naively ;-) ) split, command line.

In case of Linux, that would IMO be simply /proc/<pid>/cmdline split at 0x0 bytes. Really just that, no further "optimisations". That is, if there were were no 0x0 at all, it would still give only one field. If there were multiplie consecutive 0x0 e.g. /usr/bin/ssh\0\0\0 it would give [b'/usr/bin/ssh', b'', b'', b'']. And there would be no magic trying to split at spaces o so.

I don't know what other platforms provide in terms of data, but for another platform, that e.g. doesn't use 0x0, native would still return what the platform would return as program name and arguments. E.g. if windows would have a syscall for that, which already returns an array,.. then this would be what it uses.

Last but not least, there could be some guess (or a better name) mode, which does what's done right now: (hopefully) smart guesswork, e.g. trying to split on spaces and trying to find out, what the command name is.

But there should be a warning that this may also fail miserably, e.g. as in the examples I gave above with the various ways one could interpret unzip t x.zip.

Last but not least, in a way this issue may also be a tiny bit "security" relevant (similar to #2334).

Well not exactly like that actually. With #2334, the main problem is that the kernel’s /proc/<pid>/exe would be a trustworthy value, while /proc/<pid>/cmdline is known to be not.

But what can still happen is e.g. the following scenario:

I'm stumbled over using psutil and over these issues, when I was starting to re-implement an old shell script of mine, that kills OpenSSH [mux] processes. When the network dies or changes the IP, these processes would still remain open for some timeout time, and during that even new SSH connections would fail (as they'd try to use the [mux] process for it.

As my tool, would kill processes, I'm trying to make sure it's really an OpenSSH [mux] process and not just one that might accidentally look like one. So for once I check exe and then also cmdline which must match some specific pattern. And actually there's even one more check, using a string extracted from cmdline.

However, if the results of cmdline are processed by some heuristics, this may quite easily falsify my checks, and I may even kill the wrong processes (or - less problematic - not kill the right ones).

Thanks,
Chris.

Giampaolo Rodola · Answer 3 · Tue Dec 12 2023 08:29:11 GMT+0800 (China Standard Time)

But I don't see why such parameter would only work on Linux? raw would just return whatever the source of information gives, no further processing (i.e. on Linux it would be simply the raw bytes contents of /proc//cmdline).

As I said above, because even with the original raw bytes you would still have the same problem of not knowing what the separator is, so you would end up guessing anyway. You can go back to the raw cmdline by converting the string into bytes if that's what you want:

is_possible_tmux = b"tmux" in bytes(" ".join(psutil.Process(pid).cmdline()), "utf8")

The trailing separator is trimmed, but other than that no information is lost.

Realistically I don't think any of this matter though. cmdline() is not supposed to be an API to uniquely identify a process (heck, it can even be altered!). What you typically do is convert cmdline() into string and search for a match ("tmux" in your case). Usually you also add name() and exe() to the mix to make it more robust, but even then you'll still have no guarantees that you're killing the right process. It's similar to using ps / top followed by killall basically.

With #2334, the main problem is that the kernel’s /proc//exe would be a trustworthy value, while /proc//cmdline is known to be not.

Agreed. exe() is different than cmdline() and name(), and should give more guarantees.