nightroman / SplitPipeline

Parallel Data Processing in PowerShell

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Split-Pipeline should also accept PSObject[] for input object for performance reasons

EklipZgit opened this issue · comments

Hi, when piping large numbers of objects over the pipeline, you take a serious performance hit. Adding support for
$output = Split-Pipeline -InputObject $someLargeArray ....
would allow for drastic performance increases in some scenarios.

try $input = 0..10000 as input and test piping vs accepting a single PSObject[] array.

I know this from experience because I've written a similar commandlet to this, except wrapping RSJobs, and gained substantial performance increases by reducing the number of small objects sent individually over the pipeline this way.

Hi, I am listening. But why do not you just pipe the object? The very name Split-Pipeline suggests that input should be piped.

(If I understand your request correctly) Some native PowerShell cmdlets support this and some "similar" do not. There are similar requests and the PowerShell team is not inclined to change the original design.

I would say, if a cmdlet provides a parameter named InputObject and its type is not an array then the standard usage is piping objects.

substantial performance increases by reducing the number of small objects sent individually over the pipeline this way.

If we have some convincing proof of this then this would be a reason to think of the change. To be honest I have some doubts. The difference may be visible only in the case of very fast processing of objects. And this is the case when Split-Pipeline is not recommended, it may work slower than just a regular pipeline.

Do we have a proof that the change is needed?

commented

Maybe an -ArgumentList parameter will be better than a smart InputObject.

See example PoshRSJob
and invoke-command, of course

Hi @mazzy-ax
I do not quite understand how your suggestion is related to the topic :) PoshRSJob has InputObject for one thing and ArgumentList for another. So does Invoke-Command. OP is not against InputObject. He suggests making it [object[]].

commented

I'm agree with you - 'InputObject for one thing and ArgumentList for another'. I wanted to say:

  • if anyone want to send array to scriptblock
  • then maybe it's better to use param() in the scriptblock instead pipe hacking )

I don't know how to use scriptblock's param() in Split-Pipeline cmdlet. And I would like Spit-Pipeline will get -ArgumentList to send any variables of any type to scriptblock (as Invoke-Command).

And I hope that -ArgumentList will match to EklipZgit's suggestion of this topic.

Your suggestion makes sense. I did not investigate pros and cons of this approach yet. But it's a different topic, as far as I get it.

The thing is, Split-Pipeline is designed as a very close substitute for ... | .{process{ ... }} or even less effective and slightly different ForEach-Object. In such forms, almost nobody uses script parameters. Script blocks just use available variables. So Split-Pipeline provides the parameter Variable in order to make these variables available in parallel pipelines. If one changes his mind on using Split-Pipeline (say, temporary, for troubleshooting) then he just goes back to "native" pipelines easily.

I close the issue. Feel free to reopen with new input on the original topic or open a new issue for another topic.

I can't seem to reopen the issue. Sorry for not getting back to you on this in a timely manner.

Compare the following:

Measure-Command {
$input = 1..100000
$result = Foreach-Object -InputObject $input { $_ * 2 }
}

Measure-Command {
$result2 = 1..100000 | Foreach-Object { $_ * 2 }
}
The second is 25x slower despite having the same input and resulting in the exact same output.

The reason is that the pipeline is very, very expensive relative to the operations on the data to process.

However you choose to implement it, my request is simply that there is a way to pass the individual objects as an array rather than pipelining them. It makes a big difference on large numbers of small objects where the pipeline cost is a significant portion of the objects processing time, especially since this cost is on the parent thread rather than ammortized across the multithreading.

PoshRSJob already supports this for Start-RSJob, Foreach-Object supports this, and so on and so forth for this precise reason. Seems like a relatively important thing to support when writing a module that is specifically for optimizing compute time.

Whether you want to add another variable for the array in a different parameter set from InputObject is at your discretion, but in my experience standard practice is [object[]] $InputObject (or not type that parameter at all).

I afraid you compare with something different. Like I said before, Split-Pipeline is similar to | .{process{..}}. If we measure

Measure-Command {
$result2 = 1..100000 | .{process{ $_ * 2 }}
}

then the difference is x2, not x25. And this is the case with very fast processing when Split-Pipeline should not be used. If we measure cases with some "real" processing then the difference will be close to none, namely the same 30 milliseconds (in my case) for 100000 items no matter how slow processing is, and in real cases it will be much slower than $_ * 2.

P.S. ForEach-Object is very slow, pipelines themselves are not that bad when used properly, e.g. with the code above.

Do not get me wrong. What you propose requires some work. I am interested if it is worth it. For the moment I have doubts.

The branch 180516-issue-19 contains the quick-and-dirty implementation of the feature. Please experiment and tell the results.

NB: Use -Verbose for some info and time measurement. The new way writes "EXPERIMENTAL" in the beginning.

Couldn't get the whole module to build, but I built the dll and hotswapped it, results look good:

#control
Measure-Command {
	$results = foreach ($i in 1..1000000){
		$i * 2
	}
}
#1068 MS

#pipeline
Measure-Command {
	$out = 1..1000000 | Split-Pipeline -Script { process{ $_ * 2 }}
}
#5787 MS

#new way
Measure-Command {
	$in = 1..1000000;
	$out2 = Split-Pipeline -InputObject $in -Script { process{ $_ * 2 }}
}
#2199 MS

I say merge it if your tests pass.

Thanks!

I still have doubts. Your test clearly shows a scenario where Split-Pipeline must not be used, i.e. #1068 MS without Split-Pipeline and #2199 MS with it. But I will think about this.

Thinking practically, the performance gain is almost nothing. 3-4 seconds for 1 million items? How often do we send 1 million items as an array? And even if we do then with some real processing we are rather dealing with minutes.

But this feature still may be convenient as an alternative way of passing items in the cmdlet. We can even make InputObject the positional parameter so that the name is not required.

v1.6.0

I agree the performance gain is minimal (compared to a real scenario, the cost of processing 1 million files for example), but for performance oriented code, every bit matters in my opinion. 4 seconds saved on 4 minutes of operations is still more than a 1% performance gain.

The main concern was consistency with the practice of allowing non-pipeline input, which most cmdlets offer. Thanks for adding this in!