Split-Pipeline should also accept PSObject[] for input object for performance reasons

Question

Split-Pipeline should also accept PSObject[] for input object for performance reasons

EklipZgit opened this issue 6 years ago · comments

Hi, when piping large numbers of objects over the pipeline, you take a serious performance hit. Adding support for
$output = Split-Pipeline -InputObject $someLargeArray ....
would allow for drastic performance increases in some scenarios.

try $input = 0..10000 as input and test piping vs accepting a single PSObject[] array.

I know this from experience because I've written a similar commandlet to this, except wrapping RSJobs, and gained substantial performance increases by reducing the number of small objects sent individually over the pipeline this way.

Roman Kuzmin commented 6 years ago

v1.6.0

Roman Kuzmin · Answer 1 · Sat May 05 2018 03:47:39 GMT+0800 (China Standard Time)

Hi, I am listening. But why do not you just pipe the object? The very name Split-Pipeline suggests that input should be piped.

Roman Kuzmin · Answer 2 · Sat May 05 2018 03:53:28 GMT+0800 (China Standard Time)

(If I understand your request correctly) Some native PowerShell cmdlets support this and some "similar" do not. There are similar requests and the PowerShell team is not inclined to change the original design.

I would say, if a cmdlet provides a parameter named InputObject and its type is not an array then the standard usage is piping objects.

Roman Kuzmin · Answer 3 · Sat May 05 2018 04:47:31 GMT+0800 (China Standard Time)

substantial performance increases by reducing the number of small objects sent individually over the pipeline this way.

If we have some convincing proof of this then this would be a reason to think of the change. To be honest I have some doubts. The difference may be visible only in the case of very fast processing of objects. And this is the case when Split-Pipeline is not recommended, it may work slower than just a regular pipeline.

Roman Kuzmin · Answer 4 · Wed May 09 2018 00:46:09 GMT+0800 (China Standard Time)

Do we have a proof that the change is needed?

mazzy · Answer 5 · Wed May 09 2018 01:20:49 GMT+0800 (China Standard Time)

Maybe an -ArgumentList parameter will be better than a smart InputObject.

See example PoshRSJob
and invoke-command, of course

Roman Kuzmin · Answer 6 · Wed May 09 2018 02:38:19 GMT+0800 (China Standard Time)

Hi @mazzy-ax
I do not quite understand how your suggestion is related to the topic :) PoshRSJob has InputObject for one thing and ArgumentList for another. So does Invoke-Command. OP is not against InputObject. He suggests making it [object[]].

mazzy · Answer 7 · Wed May 09 2018 03:08:33 GMT+0800 (China Standard Time)

I'm agree with you - 'InputObject for one thing and ArgumentList for another'. I wanted to say:

if anyone want to send array to scriptblock
then maybe it's better to use param() in the scriptblock instead pipe hacking )

I don't know how to use scriptblock's param() in Split-Pipeline cmdlet. And I would like Spit-Pipeline will get -ArgumentList to send any variables of any type to scriptblock (as Invoke-Command).

And I hope that -ArgumentList will match to EklipZgit's suggestion of this topic.

Roman Kuzmin · Answer 8 · Wed May 09 2018 03:15:53 GMT+0800 (China Standard Time)

Your suggestion makes sense. I did not investigate pros and cons of this approach yet. But it's a different topic, as far as I get it.

Roman Kuzmin · Answer 9 · Wed May 09 2018 03:23:33 GMT+0800 (China Standard Time)

The thing is, Split-Pipeline is designed as a very close substitute for ... | .{process{ ... }} or even less effective and slightly different ForEach-Object. In such forms, almost nobody uses script parameters. Script blocks just use available variables. So Split-Pipeline provides the parameter Variable in order to make these variables available in parallel pipelines. If one changes his mind on using Split-Pipeline (say, temporary, for troubleshooting) then he just goes back to "native" pipelines easily.

Roman Kuzmin · Answer 10 · Fri May 11 2018 14:30:25 GMT+0800 (China Standard Time)

I close the issue. Feel free to reopen with new input on the original topic or open a new issue for another topic.

Travis Drake · Answer 11 · Thu May 17 2018 01:14:17 GMT+0800 (China Standard Time)

I can't seem to reopen the issue. Sorry for not getting back to you on this in a timely manner.

Compare the following:

Measure-Command {
$input = 1..100000
$result = Foreach-Object -InputObject $input { $_ * 2 }
}

Measure-Command {
$result2 = 1..100000 | Foreach-Object { $_ * 2 }
}
The second is 25x slower despite having the same input and resulting in the exact same output.

The reason is that the pipeline is very, very expensive relative to the operations on the data to process.

However you choose to implement it, my request is simply that there is a way to pass the individual objects as an array rather than pipelining them. It makes a big difference on large numbers of small objects where the pipeline cost is a significant portion of the objects processing time, especially since this cost is on the parent thread rather than ammortized across the multithreading.

PoshRSJob already supports this for Start-RSJob, Foreach-Object supports this, and so on and so forth for this precise reason. Seems like a relatively important thing to support when writing a module that is specifically for optimizing compute time.

Whether you want to add another variable for the array in a different parameter set from InputObject is at your discretion, but in my experience standard practice is [object[]] $InputObject (or not type that parameter at all).

Roman Kuzmin · Answer 12 · Thu May 17 2018 02:38:51 GMT+0800 (China Standard Time)

I afraid you compare with something different. Like I said before, Split-Pipeline is similar to | .{process{..}}. If we measure

Measure-Command {
$result2 = 1..100000 | .{process{ $_ * 2 }}
}

then the difference is x2, not x25. And this is the case with very fast processing when Split-Pipeline should not be used. If we measure cases with some "real" processing then the difference will be close to none, namely the same 30 milliseconds (in my case) for 100000 items no matter how slow processing is, and in real cases it will be much slower than $_ * 2.

P.S. ForEach-Object is very slow, pipelines themselves are not that bad when used properly, e.g. with the code above.

Roman Kuzmin · Answer 13 · Thu May 17 2018 02:56:43 GMT+0800 (China Standard Time)

Do not get me wrong. What you propose requires some work. I am interested if it is worth it. For the moment I have doubts.

Roman Kuzmin · Answer 14 · Thu May 17 2018 03:54:29 GMT+0800 (China Standard Time)

The branch 180516-issue-19 contains the quick-and-dirty implementation of the feature. Please experiment and tell the results.

NB: Use -Verbose for some info and time measurement. The new way writes "EXPERIMENTAL" in the beginning.

Travis Drake · Answer 15 · Sat May 19 2018 04:58:41 GMT+0800 (China Standard Time)

Couldn't get the whole module to build, but I built the dll and hotswapped it, results look good:

#control
Measure-Command {
	$results = foreach ($i in 1..1000000){
		$i * 2
	}
}
#1068 MS

#pipeline
Measure-Command {
	$out = 1..1000000 | Split-Pipeline -Script { process{ $_ * 2 }}
}
#5787 MS

#new way
Measure-Command {
	$in = 1..1000000;
	$out2 = Split-Pipeline -InputObject $in -Script { process{ $_ * 2 }}
}
#2199 MS

I say merge it if your tests pass.

Thanks!

Roman Kuzmin · Answer 16 · Sat May 19 2018 10:45:44 GMT+0800 (China Standard Time)

I still have doubts. Your test clearly shows a scenario where Split-Pipeline must not be used, i.e. #1068 MS without Split-Pipeline and #2199 MS with it. But I will think about this.

Roman Kuzmin · Answer 17 · Sat May 19 2018 13:29:44 GMT+0800 (China Standard Time)

Thinking practically, the performance gain is almost nothing. 3-4 seconds for 1 million items? How often do we send 1 million items as an array? And even if we do then with some real processing we are rather dealing with minutes.

But this feature still may be convenient as an alternative way of passing items in the cmdlet. We can even make InputObject the positional parameter so that the name is not required.

Travis Drake · Answer 18 · Tue May 22 2018 00:38:05 GMT+0800 (China Standard Time)

I agree the performance gain is minimal (compared to a real scenario, the cost of processing 1 million files for example), but for performance oriented code, every bit matters in my opinion. 4 seconds saved on 4 minutes of operations is still more than a 1% performance gain.

The main concern was consistency with the practice of allowing non-pipeline input, which most cmdlets offer. Thanks for adding this in!