mihaifm / linq

linq.js - LINQ for JavaScript

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Iterable support

alexkunin opened this issue · comments

I've stumbled upon the following behavior (JSFiddle):

const s = new Set([ 1, 2, 3 ])
const a = Enumerable.from(s.entries());
console.log(a.toArray());
console.log(a.toArray());

The output is:

[[1, 1], [2, 2], [3, 3]]
[]

So, iterator is traversed once, and never gets reset, which is understandable by itself.

But it also leads to two "kinds" of IEnumerable -- ones that can be traversed multiple times and ones that cannot, which might be very confusing and can lead to obscure bugs (ask me how I know...).

Wouldn't it be better to throw an exception (somewhere in this snippet: https://github.com/mihaifm/linq/blob/master/linq.js#L310)? That would be a breaking change, of course, but the exception will be triggered only in cases where double iterator traversal already happens, and I don't think anyone would rely on this behavior anyway.

Or maybe I'm missing something?

commented

IMO it should work as in .NET: allow multiple enumerations.

Agreed, looks like a bug. s.entries() is an iterable object in javascript, but still it should be reinitialized when calling toArray(). I'll see if I can come up with a solution.

I found a potential solution, but unfortunately it involves buffering. The problem is that you can't really reset or clone an iterator in Javascript. Some details are here:

https://stackoverflow.com/questions/46416266/how-to-clone-an-iterator-in-javascript

The answer there is a pretty impressive piece of code, but still it revolves around buffering. When the data comes in from the iterator, we save it to a buffer and then we feed values from that buffer.

I've included that function to linq.js and it works fine (file attached).
linq.zip

But the question of buffering still remains. What if you have a stream of 1M records. Do you want to buffer all that data so you can convert it to an array later?

Also, it seems like Javascript and C# handle iterators differently. Take these two pieces of code:

JS:

function* words() {
    yield "abc";
    yield "def";
}
var w = words();
var a = w;

for(var item of a)
{
    console.log(item);
}
for(var item of a)
{
    console.log(item);
}

Output:

abc
def

C#:

static IEnumerable<string> words()
{
    yield return "abc";
    yield return "def";
}
var w = words();
var a = w;

foreach (var item in a)
{
    Console.WriteLine(item);
}
foreach (var item in a)
{
    Console.WriteLine(item);
}

Output:

abc
def
abc
def

Seems like C# does some serious wizardry there, since it doesn't "consume" the iterator, even when using an alias to the object twice. I'm not sure we can replicate this behavior, not without buffering at least.

Leaving this open for discussion.

commented

@mihaifm looks like there is no good solution. We should decide what type of behavior the Enumerable provides: JS-like or .NET-like. If JS-like, then the repeated enumeration should always return an empty result (even in the cases when it currently returns the actual result). If .NET-like, then it has to do the buffering.

Just a few thoughts, in random order:

  1. Implicit buffering does not look good, "a stream of 1M records" argument is very real.

  2. Library user could introduce buffering easily: Enumerable.from([ ...iterator ]).

  3. Restarting an iterator would be great in some cases, but not something JS supports. What if we introduce the following form for restartable iterators: Enumerable.from(() => set.entries())

  4. If iterator is supplied directly (i.e. not as in p.3 above), it cannot be restarted, library user gets non-empty result only first time, which might cause hard to debug errors. Effectively, Enumerable has two modes -- repeatable and non-repeatable (and if non-repeatable, can't say if it was already evaluated once), and if it comes from outside, you can't tell one from another.

  5. One way to handle p.4 is to throw an exception when restart is attempted.

  6. Another way is to introduce "guard" methods like materiallize(), ensureRestartable(), etc. But this will transfer the complexity of handling the situation to library user.

  7. Finally, it might be beneficial to divide uses cases from the very beginning if Enumerable.from() gets iterator as input, the next method should decide the behavior, something like .buffer(), .throwOnRestart(), etc. Any other method would throw immediately.

  8. There are other places where things are implicitly converted to Enumerable, e.g. first argument of .join(). Covering all possible places with more complex approaches like additional methods could be really hard to implement. Just throwing an exception seems like the smallest evil here.

What if we introduce the following form for restartable iterators: Enumerable.from(() => set.entries())

This is actually a great suggestion. I've made the code change and did some tests, looking good. I will add this to the library.

For the other part of passing in the iterator directly, not sure, still considering options. I'm not a fan of adding exceptions, since the behavior is present in vanilla javascript.

But I think if we add the lambda input that you suggested, we should be good, there will be 3 different ways of initializing the object, depending on the needs:

Enumerable.from(() => set.entries()) // reusable iterator
Enumerable.from(set.entries()) // iterator that is consumed after the first iteration
Enumerable.from([ ...set.entries()])  // all the values are loaded into memory

Commited the changes: 45a45fb