google / guava

Google core libraries for Java

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

New method: public <T> void Streams.iterate(Stream<T> stream, Consumer<? super T> action);

archiecobbs opened this issue · comments

1. What are you trying to do?

I am trying to make it easier to avoid a common bug, and maybe also make the bug more widely known.

Here's the bug: Most people probably think that this:

Stream<T> s = ...
List<T> list = new ArrayList<>();
s.forEachOrdered(list::add);
System.out.println(list);

is the same thing as this:

Stream<T> s = ...
List<T> list = new ArrayList<>();
for (Iterator<T> i = s.iterator(); i.hasNext(); )
    list.add(i.next());
System.out.println(list);

but that's not the case, because Stream.forEach() is not guaranteed to run in the current thread (see this link for a discussion).

So while the second example always works as expected, the first may print an empty list, or a half-populated list, or throw an exception or who-knows-what (as ArrayList is not thread-safe).

For regular people who are likely to fall into this trap, having Guava provide a Streams.iterate() method would help clue them in.

But even for those smarties who understand this bug and want to avoid it, doing so is somewhat awkward, and Guava providing a Streams.iterate() method would make their lives a little easier.

2. What's the best code you can write to accomplish that without the new feature?

Option 1:

Stream<T> s = ...
List<T> list = new ArrayList<>();
for (Iterator<T> i = s.iterator(); i.hasNext(); )
    list.add(i.next());

The problem with option 1 is that it's wordy and inelegant.

Option 2:

Stream<T> s = ...
List<T> list = new ArrayList<>();
for (T t : (Iterable<T>)s::iterator)
    list.add(t);

Slightly less wordy, but still ugly with the cast, which is something of a moral violation because Stream are not really Iterable because they can only be used one.

Also this option is less efficient, because it compiles down into an INVOKEDYNAMIC which actually creates an Iterable object which in turn just invokes Stream.iterator(). That's unnecessary extra hops.

3. What would that same code look like if we added your feature?

Stream<T> s = ...
List<T> list = new ArrayList<>();
Streams.iterate(s, list::add);

Ah, yes. Short, sweet, and race-free.

(Optional) What would the method signatures for your feature look like?

public static <T> void iterate(Stream<T> stream, Consumer<? super T> action);

Concrete Use Cases

Here's my real world story:

I used maven-modernizer-plugin, which complains if you use methods like Guava's Iterables.filter() because it says you should be using the Stream API now.

OK fine, I'd love to use the Stream API, and so when my code had a major version bump, I took the opportunity.

I reworked a bunch of my code so methods that previously returned Iterable now return Stream.

Then I replaced a bunch of for-each loops over Iterables with Stream.forEach()'s and Stream.forEachOrdered()'s.

Great! Modern code, right?

Oops. I suddenly realized that I had just moved all of the code in my loops onto some other random thread (possibly), and therefore inadvertently spammed my code with race condition bugs.

For a concrete example of how the matters: I have several API's that previously returned Iterable<byte[]> and now return Stream<byte[]> and so guess what? The bytes in the byte[] array are no longer guaranteed to be up-to-date when they are accessed in the loops because now that code could be running on some random thread, and there is nothing in my application that sets up the required synchronization to ensure the array data is synchronized to the array.

This is something of a booby trap and I have to imagine there are lots of people with the same issue (whether they know it or not). It would be nice if there were a simple, well-known (via Guava) solution to this common problem.

Packages

com.google.common.collect

Checklist

This may not be helpful, but other alternatives solutions are to use Collections.synchronizedList,
Iterators.addAll, or Iterator.forEachRemaining. I personally never minded the classic for each loop idiom or a while loop when dealing with iterators. In Apache Commons' there is IteratorUtils.forEach if you use that library as well.

fwiw, unlike the BaseStream JavaDoc's claim I have observed iterator() eagerly materialize the full I/O stream before traversal, thus resulting in an initialization pause and memory spike. In that case I switched to forEach to lazily materialize the next element. I'd be afraid anyone running into your problem might still be surprised and may find the implementation shifts underneath them as their assumptions are not contractual.

I applaud the desire to avoid depending on the current implementation. In this particular case, though, I don't think we should add this API.

Part of that is that I'm not sure the hermeneutics discussion—thanks for the link!—has resolved whether the choice of thread is unspecified. I could see reading "The behavior of this operation is explicitly nondeterministic" as referring only to ordering, with the following sentence's "For parallel stream pipelines" as a qualifier that extends all the way to "in whatever thread the library chooses." That may still be wrong: The Goetz quote speaks of removing the one sentence about parallel pipelines, which would leave us with a clearly unqualified statement that threading is unspecified. Then we could try to contend with passages like this one, in which we're told that a forEach* call like yours is unsafe "if executed in parallel"... which likewise doesn't prove that it otherwise is safe :)

I admit that it's not totally implausible to imagine that a Stream might be backed by, say, an RPC, which might receive responses on multiple threads (though I probably wouldn't recommend it). In that case, I'd still be tempted to say, even if we allow for multiple threads to run the Consumer, that it is defensible to expect for the Stream to set up appropriate happens-before edges so that operations like ArrayList:add are well defined. (Certainly that's true for the specific example of forEachOrdered, which makes an explicit guarantee of that.) Now even that could lead to problems if a caller were depending on ThreadLocal state.

But at some point, this becomes a case in which I wouldn't try to spend our "prescriptiveness budget" in advance of real problems: If we push users to use our API because forEach might someday lead to problems, they're going to be annoyed at the loss of single-statement stream operations, and some of them still aren't going to do what we recommend. So, if someday we see real problems from this, then we're going to have a cleanup to perform no matter what. It will be more efficient for us to do it for all of Google than for us to ask hundreds of developers to do it gradually over the years. (The world outside Google is another story, of course, but we can't force developers to use our method there any more than we can migrate them to it.)

Part of that is that I'm not sure the hermeneutics discussion—thanks for the link!—has resolved whether the choice of thread is unspecified...

For the record, there's no ambiguity on this point. The Javadoc for Stream.forEach() says:

For any given element, the action may be performed at whatever time and in whatever thread the library chooses.

And for Stream.forEachOrdered() it says:

...for any given element, the action may be performed in whatever thread the library chooses.

Seems pretty clear to me.

Regarding this:

If we push users to use our API because forEach might someday lead to problems...

It sounds like you are OK with developers using an API incorrectly in such a way that could lead to difficult-to-find bugs and security holes, as long as the way that the API happens to be implemented today by the particular implementation that most people happen to be using today just happens to not trigger these bugs today.

Hmm, OK. I'm sure it will be fine.... what could go wrong? :)

Right, my first contention is that it's not clear whether "the action may be performed at whatever time and in whatever thread the library chooses" is qualified by "For parallel stream pipelines" or not. My second contention is that the library provides any necessary happens-before edges to keep your example working correctly (and might also do so in the case of forEach, but that's unclear).

My third contention is that we have almost no power to prevent people from using this API incorrectly. If it starts to bite people in practice and they come crawling back, I am happy to belatedly offer to rescue them :)

but isn't this feature request equivalent to s.iterator().forEachRemaining(action)?

Ah, true! I'd almost suggested that forEachOrdered would be good enough, but there was the ThreadLocal caveat. But your way solves that. @archiecobbs, what do you think?