Improve inconsistent collection return types in RowMetadata

Question

Improve inconsistent collection return types in RowMetadata

lukaseder opened this issue 3 years ago · comments

Feature Request

Is your feature request related to a problem? Please describe

The various RowMetadata methods make inconsistent suggestions about what kind of collection type the contained ColumnMetadata are contained in. We have:

ColumnMetadata getColumnMetadata(int index); suggesting it is a List<ColumnMetadata>
ColumnMetadata getColumnMetadata(String name); suggesting it is a Map<String, List<ColumnMetadata>>
Iterable<? extends ColumnMetadata> getColumnMetadatas();
Collection<String> getColumnNames();

Describe the solution you'd like

Given that it is definitely a List since we can access elements by index, I would suggest these two are adapted:

List<? extends ColumnMetadata> getColumnMetadatas();
List<String> getColumnNames();

Apart from the oracle driver, I've seen that all of the drivers already covariantly override getColumnMetadatas() to return a List anyway.

Teachability, Documentation, Adoption, Migration Strategy

This also affects #218, where similar, new API is being created for OUT parameters. I found this issue here while reviewing #218.

Mark Paluch · Answer 1 · Wed Apr 28 2021 21:45:44 GMT+0800 (China Standard Time)

That's basically the same for Row.get(String/int), isn't it? getColumnMetadata uses the same lookup mechanism as Row.get and that is due to the duality that we want to use names to identify result columns (aliases?) but actually a row has multiple columns and the only way to address those reliably is by index.

I think for getColumnMetadatas, we should be able to return List. getColumnNames is defined as:

Returns an unmodifiable collection of unique column names, and any attempts to modify the returned

which is actually a Set as per definition. However, the iteration order is aligned with the actual column order in which columns are returned (similar to LinkedHashSet). Additionally, there are case-insensitive lookup rules which makes it difficult to reason that getColumnNames is actually a List.

Lukas Eder · Answer 2 · Wed Apr 28 2021 22:11:08 GMT+0800 (China Standard Time)

I hadn't seen the "unique" bit. What is the usefulness of this collection being a Set of case insensitive column names?

I mean the relevant information is in getColumnMetadatas() if anyone actually needs reliable information about column names. Let's say, someone builds a tool (e.g. an editor) and wants to display column names dynamically. The user runs this query:

select 1 as "AA", 2 as "Aa", 3 as i, 4 as "aA", 5 as "aa", 6 as "AA";

The getColumnNames() method is quite useless in this case, but those users would use getColumnMetadatas() anyway, because they need type information as well.

So, what is the method good for? I wouldn't mind if it were removed entirely...

Mark Paluch · Answer 3 · Wed Apr 28 2021 22:36:11 GMT+0800 (China Standard Time)

It is mostly used as shortcut. You can iterate over it in the order of column name appearance and call getColumnNames().contains(…) to avoid looping over getColumnMetadatas() to case-insensitively check the names of the individual columns.

because they need type information as well.

It depends on what you're doing. If types are driven by the database, either Row.get(…, Object.class) is fine, or said type information can be used.
If you have a conversion framework underneath, then basically the type metadata becomes unused because you try to convert the retrieved value into the type that is specified by a target property. Finally, Row.get(…, Integer.class) renders the type information unused, too because someone is explicitly asking for a target type.

Lukas Eder · Answer 4 · Thu Apr 29 2021 00:11:36 GMT+0800 (China Standard Time)

It is mostly used as shortcut.

But who needs to use that shortcut? SPI consumers? Probably not, they need the "real thing". Users of R2DBC-the-API? They're not the target audience of this SPI.

You can iterate over it in the order of column name appearance

Alternatively: getColumnMetadatas().stream().map(m -> m.getName()).toList(), once getColumnMetadatas() returns a List

and call getColumnNames().contains(…) to avoid looping over getColumnMetadatas() to case-insensitively check the names of the individual columns.

But that is in violation of the Collection.contains() contract, which reads (emphasis mine):

"Returns true if this collection contains the specified element.More formally, returns true if and only if this collection contains at least one element e such that Objects.equals(o, e)."

(I know that SortedSet regrettably violates this contract too)

Besides, getColumnMetadata(name) != null also executes such a check.

I find it to be a confusing distraction, not really adding much value. It seems like a leftover from earlier iterations of the SPI... This bug here is another hint that the Collection type is not a good fit: r2dbc/r2dbc-mssql#200. I could report more, e.g. iterator() isn't implemented correctly in MssqlRowMetadata. And in fact, the current implementation doesn't implement that Set semantics you've mentioned (using the mssql driver):

System.out.println(
Flux.from(cf.create())
    .flatMap(c -> Mono.from(c.createStatement("select 1 as a, 2 as a, 3 as a").execute()))
    .flatMap(it -> it.map((r, m) -> m.getColumnNames()))
    .blockFirst()
);

It yields:

MssqlRowMetadata [a, a, a]

Same with the H2, MariaDB, Oracle, PostgreSQL drivers. Everyone (including yourself 😉) overlooked this uniqueness requirement in the contract. It seems unnecessary and complicated to implement correctly. So, the status quo is effectively returning a List<String>.

In any case, something should be fixed here:

Either the implementations
Or the contract (probably the path of least resistance right now)
Or the method is removed

Mark Paluch · Answer 5 · Thu Apr 29 2021 14:19:12 GMT+0800 (China Standard Time)

Alternatively: getColumnMetadatas().stream().map(m -> m.getName()).toList(), once getColumnMetadatas() returns a List

That creates quite some GC pressure and CPU overhead so we're happy staying away from that and using good old for-loops. There's another aspect here. For drivers, it's pretty easy to reason about metadata retention while a consumer has to call getColumnMetadatas() on each mapped row since the consumer cannot reliably cache the metadata outcome. The result is increased memory pressure.

getColumnMetadata(name) != null

getColumnMetadata(…) throws an exception if the column isn't present (both, by index and by name to remain consistent with Row.get(…)).

The discussed constraints make it somewhat inconvenient to come up with something reasonable that doesn't bring us into the business of specifying own collection-like interfaces.

It would make sense to address the impedance mismatch for contains without imposing needs for iterating on the client side before we remove getColumnNames(). I agree that iterating over getColumnNames() is as good as iterating over getColumnMetadatas() and extracting the name.

From that perspective, I suggest deprecating getColumnNames() for removal with 1.0 and introducing a contains(String) method (for now even as default method) that basically corresponds with getColumnNames().contains(…). Everything else can be achieved through getColumnNames().

I also suggest following your recommendation to let getColumnMetadatas return an unmodifiable List.

Lukas Eder · Answer 6 · Thu Apr 29 2021 14:48:04 GMT+0800 (China Standard Time)

There's another aspect here. For drivers, it's pretty easy to reason about metadata retention while a consumer has to call getColumnMetadatas() on each mapped row since the consumer cannot reliably cache the metadata outcome. The result is increased memory pressure.

If that were a valid concern, then offering a shortcut to column names is insufficient. We'd also need a shortcut to java types, types, precisions, scales, nullabilities, etc. But it again feels like a problem solved at the wrong place.
Doesn't reactor offer primitives of caching (or whatever they call it) the first encounter of an item and zipping it with the rest?

In any case, performance concerns can be addressed most easily to some extent by making getColumnMetadatas() return a List rather than an Iterable.

But as you say yourself, the main driver for a getColumnNames():Collection<String> method seems to have been to offer a contains(String):boolean or hasColumnMetadata(String):boolean method, which is not unreasonable, specifically because of that exception that is thrown by getColumnMetadata(String), which I overlooked.

So, if that's going to be the decision here, then I won't report more issues like r2dbc/r2dbc-mssql#200, or offer PRs for those...