eclipse / jnosql

Eclipse JNoSQL is a framework which has the goal to help Java developers to create Jakarta EE applications with NoSQL.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for iterables

rkrashr opened this issue · comments

DocumentCollectionManager::search produces a list.
Do we assume that result will always comfortably fit in memory?
In NoSQL world, it often doesn't and this prevents the whole (very convenient) API to be used at all.

We need a separate search implementation that returns an iterable, or a stream, or an observable

This would be easy to create from existing search implementation. Looking at drivers, most of the time it does iterable -> stream(iterable).collect(toList)

Hey, that totally makes sense. Furthermore, the iterable does not have alteration methods, therefore, they are perfect to read-only operation.
I'll move this discussion also to the email list, please, stay tuned.

Thank you for the feedback

It’s been on my mental back-burner to try out some lazy-loading options in my driver. I’ve considered if it’d work to make a List implementation that lazy-loads internally, since that’d keep some handy methods that Iterable doesn’t have. On the other hand, maybe Stream would be most idiomatic now, and could have efficient count, etc. operations while also having a stronger implication of lazy-loading.

-Jesse

Hey @otaviojava , thanks for you comments. I am new to this project, just thought this would be useful.
Stream would work in a generic way, I would leave it as iterable or rx observable. This way it is easier to handle exceptions

Ok, I checked several drivers and yes.
Usually, they return as Iterable.
I have on my mind two solutions both it will change the result to all search results.

We can change the result to Iterable itself, e.g.:

Iterable<DocumentEntity> query(String query);

What I propose it to have an interface to represents this result as iterable, therefore:

  1. We can have a Stream default method to make easier to the final user.
  2. We can create a default supplier where give the entities result from database and a converter, to translate the entities to Jakarta NoSQL API, it will create a default Result to the drives, so make to driver support as well.
import java.util.Objects;
import java.util.function.BiFunction;
import java.util.function.Function;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;

public interface Result<T> extends Iterable<T> {
    Stream<T> stream();

    static <E, T> Result<T> of(Iterable<E> entities, Function<E, T> converter) {
        Objects.requireNonNull(converter, "converter is required");
        Objects.requireNonNull(entities, "entities is required");
        final ResultSupplier<T, E> supplier = ServiceLoaderProvider.get(ResultSupplier.class);
        return supplier.apply(converter, entities);
    }

    interface ResultSupplier<T, E> extends BiFunction<Function<E, T>, Iterable<E>, Result<T>> {

    }
}

The new return:

Result<DocumentEntity> query(String query);

So:

   //database provider side
  Iterable<DatabaseEntity> entities = ...;//native query
  Function<DatabaseEntity, ColumnEntity> converter = ...;//the translate logic
  final Result<ColumnEntity> result = Result.of(entities, converter);//return this value to the user

  //lazy value
  for (ColumnEntity columnEntity : result) {
       System.out.println("");
  }
  //return as Stream
  final Stream<ColumnEntity> columnEntityStream = result.toStream();

@rkrashr That is an amazing way to start!
Thank you.

On the topic of working with iterables instead of lists: I like having either iterable or stream as the default internally, and then having the intermediary code able to convert between them. The nice thing there is that it could end up being whatever is most efficient for the driver... in Darwino, the internal Cursor maps pretty closely to Stream operations, so it could be very efficient implementing that directly
Maybe ideally it'd be up to the driver: default to Iterable as the standard mechanism of exchange and use StreamSupport conversion as the default, but also have the option for drivers to produce customized Streams

by @jesse-gallagher

Yes, looks good to me - easy to take current List-based implementations and switch to that, but then also smooth to add in efficient specialized versions.

Having lazy collections in general would take care of one of the places in my apps where I have to dip into the native API - namely, getting collection counts. The more I can work with Repository objects, the better.

I've created the first PR: jakartaee/nosql#19

@keilw what do you think?

@otaviojava @jesse-gallagher Can you explain the need for a duplicate of List or Collection? Hazelcast or Apache Ignite offer distributed implementations of List, Set or both, so not sure, if duplicating parts of the Java Collections API makes NoSQL more attractive. Collection also is Iterable, so where is the need to introduce a custom implementation or extension to that?

@keilw

We are not duplicating Collection; indeed, we're using Iterable instead.

Why?

That is because several NoSQL database uses the Cursor pattern, that means, it retrieves the information lazily instead of putting everything in the memory.
The idea of using Iterable instead of Collection is to keep to use this behavior in the API. Furthermore, Collection has a method to either add or remove an element, but, once the result is read-only, the Iterable is suitable. I believe that why several NoSQL drives use it.

Vendors such as:

  • MongoDB
  • Cassandra
  • ArangoDB

However, I would strongly argue against the toStream() method, even if for some reason we stick to Iterable only. The behavior should be similar to Collection.
See the JavaDoc saying "Returns a sequential {@code Stream} with this collection as its source." This is pretty much the same as the toStream method, so why not call it stream() instead?

Whether or not we need a parallel stream we can see.

Ok, that makes sense.

@keilw
Done,

   /**
     * Returns a sequential {@link Stream} with this result as its source.
     *
     * @return a {@link Stream} from the result
     */
    Stream<T> stream();

@rkrashr @jesse-gallagher @keilw

I found an issue with this path.
Iterable allows me to run the iterator multiple time and that does not happen in several vendors because it uses a cursor.

My question:
What do you think: instead of using this Result interface we use Stream directly?

We can create a utilitarians class to help the driver from an Iterable create a Stream that is not a big deal.

So:

Stream<DocumentEntity> query(String query);

That's a good argument for implementing Stream as the default lowest level. And, like you say, the important thing is that these are all largely interchangeable if desired. So List-based implementations now are good with Iterable and Stream, and then either of those two could convert to the other pretty easily. It'd just be up to the driver's choice to pick the one that is most efficient for it.

From an Iterable I can create a Stream easily:

Stream<?> stream = StreamSupport.stream(iterable.spliterator(), false);

To drivers case:

Iterable<NativeEntity> nativeEntity = nativeEntityResult;
Stream<DocumentEntity> stream = StreamSupport.stream(iterable.spliterator(), false)
.map(convertToDocumentEntity());

And it will execute when there is the terminal method, so lazy.

About keep both options, that looks a challenge.
What do you have in mind?

That's exactly what I mean - in the cases where a driver is best off using Iterable internally, it's easy to convert it to a naive Stream. Then the ones (like mine) that would fit Stream better, that gives the most capability and flexibility.

I know, the team is a bit smaller so far (as many Jakarta EE specs btw which are sometimes only worked on by 2-4 people ;-) so wonder, if we could run this by a few people from the old JPA EG (https://www.jcp.org/en/jsr/detail?id=338) like @lukasj (Jakarta Persistence Lead), @emmanuelbernard (also Hibernate), @alexsnaps (he forked Hibernate OGM and he was also at JCrete, although I don't know if he joined the Jakarta EE sessions) or @odrotbohm who has already made suggestions here earlier?

I agree that forcing the API into List might be a bit invasive, especially when it comes to large datasets as that means that you'll have to read all data into objects first before you hand them all out as result. This might result in long execution times and higher memory usage. In fact, the latter can't even be countered with asynchronous executions via a CompletableFuture as that still has to prepare the entire list. If the work on a particular thread takes a bit of time, chances are high that you're exhausting the thread pool pretty quickly and are back to blocking behavior.

Ultimately, we think reactive APIs are the right approach as that is able to really stream data out of the store, enable continuous queries and at the same time keep the memory profile low, even if you process GBs of data. That said, I totally understand that this is probably out of scope for at least right now. Also, this changes the programming model completely and users might not be eager to get into that in the first place.

So for Spring Data staying with Iterable in the core interfaces had a couple of upsides and downsides:

Pros

  • We don't prescribe having to use List to implementations.
  • APIs that are list based can still use List as return types by simply redeclaring the method making stronger guarantees about the returned value. Users can then refer to the store specific type that made the re-declarations if they absolutely need List in their code. However, this limits portability between stores but API-level portability has never been a primary goal for Spring Data.
  • We were able to introduce an optional type Streamable that makes up for the lack of methods to easily stream over the result on Iterable and provides basic functionality to map over the results etc. We don't use it in our type signatures (yet) but users are free to redeclare methods with it and the underlying proxy will adapt to it naturally.

Cons

  • While using Iterable and even more so Stream as return type avoids the collection of objects into a list in the best case, it raises the question about the usage of that Iterable and how to properly handle resources required by the Iterable. Let's take JPA's Query.getResultStream() for example: to eventually process the results contained in it, the underlying EntityManager, database connection etc. still has to stay open. Also, we need a signal that the consumption of the stream has ended, to know when we actually have to release the handles. I.e. Stream instances returned from repositories have to be used within a try-with-resources block (to get the ….close() signal) and the Spring specific @Transactional method, as that controls the lifecycle of the EntityManager.

This is quite a constraint to how users can actually make use of Stream and it's also tricky to properly prevent misuse. If user code hands out the Stream out of such a code block, it will happily process a few results depending on how long the closing of the actual database connection will take. This leads to very subtle bugs which look totally weird from the outside. Even worse, with a connection pool in place you might still run a query on a connection that's already returned to the pool and see weird errors from a different thread that by accident got this particular connection handed out from the pool.

In Spring Data, we at least check that the invocation of an Stream method has to happen within a transactional method. Be sure to understand that @Transaction does not necessarily mean database transaction here but rather unit of work that just controls resource lifecycle (EntityManager, database connection etc.).

Thanks for the detailed input. Especially the last part also has some relation to ideas we discussed here or in different places (e.g. MicroProfile mailing list or the JCrete Unconference) how this spec and a possible subset of JPA could interact with each other or build upon each other in Jakarta EE from 9 on.

Thank you @odrotbohm
For the feedback, I really appreciate your time to write it.
Yes, that is a tough decision.
If we go to the stream way we need to make it clean through documentations.

I'll admit that I've only used Stream fairly straightforwardly to date, but would it suffice to decide that terminal operations are expected to have the side effect of closing the backing resources?

The iterator() method would be a weird one there, since it's officially terminal, but a user would reasonably expect it to still be lazy-loading from a DB perspective. Even then, that method is explicitly called out in the Javadocs as being weird: "Only the terminal operations iterator() and spliterator() are not; these are provided as an "escape hatch" to enable arbitrary client-controlled pipeline traversals in the event that the existing operations are not sufficient to the task."

@odrotbohm, very well put
I would not call handling open resources a con though. It just sets the requirements for handling associated resources and makes sure they are properly closed and disposed when stream is drained. And this includes an associated cursor, but not associated db connection
And roughly, the user-side code goes like this:

try (connectionManager = openConnection()) {
    Stream<?> dataset = conectionManager.build().stream()
    dataset.forEach(System.out::println)
}

but not like this:

Stream<?> dataset;
try (connectionManager = openConnection()) {
    dataset = conectionManager.build().stream();
}
dataset.forEach(System.out::println);

@jesse-gallagher

You can use the same concept that you're doing with Iterable. So, when there is not the next element, you can close it. So, we will face the same issues with both Stream and Iterable.