[HELP] 0.3.2 pre-release for public testing

Question

[HELP] 0.3.2 pre-release for public testing

zhicwu opened this issue 3 years ago · comments

Zhichun Wu commented 3 years ago

Background

v0.3.2 was a minor release scheduled to be released months ago, but now it's a complete rewrite mainly for two reasons:

decoupling(see #570 for details)
- Java client is async and lightweight
- JDBC driver is built on top of Java client
switching data format to RowBinary to fix issues and improve performance
Benchmark results...
0.3.2-test1...
- clickhouse-grpc-jdbc and clickhouse-http-jdbc are new JDBC driver(0.3.2) using RowBinary data format
- clickhouse-jdbc is the old JDBC driver(0.3.1-patch) based on TabSeparated
- clickhouse-native-jdbc is ClickHouse-Native-JDBC 2.6.0
  Benchmark settings: thread=1, sampleSize=100000, fetchSize=10000, mode=throughput(ops/s).
0.3.2-test3...

Unlike previous round of testing, ClickHouse container is re-created a few minutes before benchmarking each driver.
- Single thread
  - Comparison
    
    Note: HttpClient is async(uses more than one thread in runtime); gRPC uses gzip(why?) which is slower than lz4.
  - VM utilization
    
    Note: on client side, the new driver consumes less memory and CPU than others, BUT higher CPU on server side(due to overhead of http protocol?).
- 4 threads
  - Comparison
  - VM utilization
0.3.2...

Query performance is similar as shown in 0.3.2-test3 so this time we only focus on insertion.

Note: gRPC does not support LZ4 compression so we use GZIP in the test.
- Single thread
- 4 threads

0.3.2-test1, 0.3.2-test2, and 0.3.2-test3 are pre-release for public testing.

Downloads

Maven dependency:

<dependency>
    <!-- will stop using group id "ru.yandex.clickhouse" starting from 0.4.0  -->
    <groupId>com.clickhouse</groupId>
    <!-- or clickhouse-grpc-client to use gRPC client  -->
    <artifactId>clickhouse-http-client</artifactId>
    <version>0.3.2-test3</version>
</dependency>

To download JDBC drivers:

Package	Size	Legacy	New	HTTP	gRPC	Remark
clickhouse-jdbc-0.3.2-all.jar	18.6MB	Y	Y	Y	Y	Both old and new JDBC drivers(besides netty, okhttp is included as well)
clickhouse-jdbc-0.3.2-http.jar	756KB	N	Y	Y	N	New JDBC driver with only http support
clickhouse-jdbc-0.3.2-grpc.jar	17.3MB	N	Y	N	Y	New JDBC driver with only grpc support(only netty, okhttp is excluded)
clickhouse-jdbc-0.3.2-shaded.jar	2.8MB	Y	Y	Y	N	Both old and new JDBC drivers

Note: the first two are recommended. grpc is experimental so you'd better use http.

Known Issues

new driver(com.clickhouse.jdbc.ClickHouseDriver) does not work with version before 21.3
java.io.IOException: HTTP/1.1 header parser received no bytes when using JDK 11+ and http_connection_provider is set to HTTP_CLIENT
RESOURCE_EXHAUSTED: Compressed gRPC message exceeds maximum size - increase max_inbound_message_size to resolve
select 1 format JSON works in http but not grpc, because grpc client is not aware of response format
insert into table values(?, ?) is slow in batch mode - try insert into table select c2,c3 from input('c1 String, c2 UInt8, c3 Nullable(UInt32)') instead
use_time_zone and use_server_time_zone_for_dates properties do not work
no table/index show up under jdbc(*) database
roaringbitmap is not included in the shaded jar

Key Changes

Java client and JDBC driver are now in different modules, along with JPMS support
Replaced data format from TabSeparated to RowBinary
Support more data types including Date32, Geo types, and mixed use of nested types
JDBC connection URL now supports abbrebation, protocol and optional port
- jdbc:ch://localhost is same as jdbc:clickhouse:http://localhost:8123
- jdbc:ch:grpc://localhost/db is same as jdbc:clickhouse:grpc://localhost:9100/db
New JDBC driver class is com.clickhouse.jdbc.ClickHouseDriver(will remove ru.yandex.clickhouse.ClickHouseDriver starting from 0.4.0)
JDBC connection properties are simplified
- use custom_http_headers and custom_http_params for customization - won't work for grpc client
- jdbcCompliant(defaults to true) to support fake transaction and standard synchronous UPDATE and DELETE statements
- typeMappings to customize type mapping(e.g. DateTime=java.lang.String,DateTime32=java.lang.String)

Some more details can be found at #736, #747, #769, and #777.

Alexander Millin · Answer 1 · Thu Dec 02 2021 00:39:07 GMT+0800 (China Standard Time)

Hello, @zhicwu
I tried a new driver in Intellij DataGrip and noticed that the DateTime format is different from that in version 0.3.1.

SELECT now()
0.3.1-patch - 2021-12-01 19:27:12
0.3.2-test1 - 2021-12-01T19:27:12
SELECT now('Europe/Paris')
0.3.1-patch - 2021-12-01 17:27:15
0.3.2-test1 - 2021-12-01T17:27:15+01:00

Zhichun Wu · Answer 2 · Thu Dec 02 2021 20:56:16 GMT+0800 (China Standard Time)

Hello, @zhicwu I tried a new driver in Intellij DataGrip and noticed that the DateTime format is different from that in version 0.3.1.

SELECT now()
0.3.1-patch - 2021-12-01 19:27:12
0.3.2-test1 - 2021-12-01T19:27:12

SELECT now('Europe/Paris')
0.3.1-patch - 2021-12-01 17:27:15
0.3.2-test1 - 2021-12-01T17:27:15+01:00

Thank you for testing the new driver and reporting the issue here. I'll try to get it fixed in next build. Actually I like the way DBeaver handle LocalDateTime and especially OffsetDateTime:

Update:
we cannot change behavior of DataGrip, but you can update connection properties by changing typeMappings in 0.3.2-test3 as a workaround:

Felix Mueller · Answer 3 · Thu Dec 02 2021 21:11:19 GMT+0800 (China Standard Time)

This is one of the crucial parts of the usage:

Are clients simply calling getString?
Are clients calling different methods depending on the type of the column reported? If so, which methods do they use?
- getObject(int)
- getObject(int, Class<?> -- which class?
- getTime(int)
- getDate(int)
- getTimestamp(int)

Zhichun Wu · Answer 4 · Fri Dec 03 2021 00:51:52 GMT+0800 (China Standard Time)

This is one of the crucial parts of the usage:

Are clients simply calling getString?

Are clients calling different methods depending on the type of the column reported? If so, which methods do they use?

getObject(int)

getObject(int, Class<?> -- which class?

getTime(int)

getDate(int)

getTimestamp(int)

Looks like a combination of getObject() and then convert LocalDateTime(timestamp without time zone)/OffsetDateTime(timestamp with time zone) to string. DBeaver on the other hand has a display issue - submitted dbeaver/dbeaver#14772 to track status.

Kim, DoHyung · Answer 5 · Sun Dec 05 2021 19:05:26 GMT+0800 (China Standard Time)

It seems that the Java client is currently relying on thread pools underneath to implement its async API, I mean, instead of being fully async at its core. Is it tentative or intended?

Also if I make a request to a node and try to make another request w/ the same client to a different node, then depending on the timing, the former call may fail since the second request may close the underlying connection too early. In that regard, I'm not sure what's the intended usage pattern of the Java client API.

For instance, if I want to make a few requests to ClickHouse in the context of an incoming HTTP call, then what am I supposed to reuse across the requests to a ClickHouse node? ClickHouseClient? ClickHouseRequest?

UPDATE: I had a look at ClickHouseConnectionImpl and it seems appropriate to hold a ClickHouseRequest (A reference to a ClickHouseClient is held by it).

Zhichun Wu · Answer 6 · Mon Dec 06 2021 19:30:04 GMT+0800 (China Standard Time)

Thanks @dynaxis, these are all good points.

It seems that the Java client is currently relying on thread pools underneath to implement its async API, I mean, instead of being fully async at its core. Is it tentative or intended?

Unfortunately this is intended, because JDBC driver is built on top of the client, meaning we prefer least dependency and we still need to support JDK 8. I hope we can find somewhere in the middle - a compact lib to serve very basic functions for both JDBC and R2DBC drivers.

For instance, if I want to make a few requests to ClickHouse in the context of an incoming HTTP call, then what am I supposed to reuse across the requests to a ClickHouse node? ClickHouseClient? ClickHouseRequest?

Yes, you can reuse ClickHouseRequest. Each time you call its execute()/send() method, it will create a sealed copy for the execution, which is similar as copy-on-write data structure for thread safety. On the other, ClickHouseClient is responsible for handling protocol-specific details like how to execute a request and get response. Taking http as an example, depending on whether the concrete http connection(e.g. HttpURLConnection) is reusable, it may suggest to create new connection for each request or simply reuse the same one.

Felix Mueller · Answer 7 · Mon Dec 06 2021 21:52:18 GMT+0800 (China Standard Time)

Building the current master version using mvn install leads to ca. 6 failures. Will investigate further.
Throwing Exceptions in setAutocommit(false) breaks the Metabase driver. However, I understand that we we would like to inform the client early on that we do not support transactions. I will see what I can do about it.

Zhichun Wu · Answer 8 · Thu Dec 09 2021 21:32:46 GMT+0800 (China Standard Time)

Building the current master version using mvn install leads to ca. 6 failures. Will investigate further.

Throwing Exceptions in setAutocommit(false) breaks the Metabase driver. However, I understand that we we would like to inform the client early on that we do not support transactions. I will see what I can do about it.

Did you mean the version on master branch? We should test the one on develop branch and fake transaction should work.

Felix Mueller · Answer 9 · Sun Dec 12 2021 21:53:06 GMT+0800 (China Standard Time)

DatabaseMetaData behavior is a breaking. Perhaps the new behavior is "more correct", but it is a change users need to be aware of. Compare results of this program:

    public static void main(String[] args) throws Exception {
        String url = "jdbc:clickhouse://localhost:8123/default";
        ClickHouseDataSource dataSource = new ClickHouseDataSource(url);
        try (ClickHouseConnection conn = dataSource.getConnection()) {
            DatabaseMetaData meta = conn.getMetaData();
            ResultSet rs = meta.getTables("foo", null, "%", new String[] {"TABLE", "VIEW", "FOREIGN TABLE", "MATERIALZED VIEW"});
            ResultSetMetaData rsMeta = rs.getMetaData();
            while (rs.next()) {
                for (int i = 0; i < rsMeta.getColumnCount(); i++) {
                    System.out.println(rsMeta.getColumnName(i+1) + ": " + rs.getString(i+1));
                }
            }
        }
    }

Kim, DoHyung · Answer 10 · Mon Dec 13 2021 01:30:22 GMT+0800 (China Standard Time)

I'm not sure this is the right place to comment on the current pre-release. Anyway, how about making ClickHouseRequest.getClient() public? I'm injecting a ClickHouseRequest to every invocation of HTTP request handlers and it's cumbersome to pass ClickHouseClient around together w/ it.

Zhichun Wu · Answer 11 · Mon Dec 13 2021 07:55:41 GMT+0800 (China Standard Time)

DatabaseMetaData behavior is a breaking. Perhaps the new behavior is "more correct", but it is a change users need to be aware of. Compare results of this program:

Yes, I added some missing parts and made changes as well to make the driver more JDBC-compliant.

I'll try to emphasize the change in release notes - all in all, 0.3.2 is not a drop-in replacement of previous version, despite its version(just trying to stick with the roadmap).

Apart from that, as you may notice, more table types were added. I was kind of hoping DBeaver can support that like in SQuirrel SQL(see dbeaver/dbeaver#14773 ).

Zhichun Wu · Answer 12 · Mon Dec 13 2021 08:06:19 GMT+0800 (China Standard Time)

I'm not sure this is the right place to comment on the current pre-release. Anyway, how about making ClickHouseRequest.getClient() public? I'm injecting a ClickHouseRequest to every invocation of HTTP request handlers and it's cumbersome to pass ClickHouseClient around together w/ it.

Yes it's the right place and thanks for the feedback.

how about making ClickHouseRequest.getClient() public?

I thought about this a few times as well. In the very beginning, there's a Context object as glue to connect almost everything(client, config, request and response), and then later I removed it as it became too heavy and complex. Later I made request.getClient public for convenience, and then I reverted the change as combination of copy and execute() / send() methods are sufficient enough to my problems.

What's your case of making the method public? If it's about issuing more queries, you can try something like baseRequest.copy().query("sql").execute()?

Kim, DoHyung · Answer 13 · Mon Dec 13 2021 09:48:31 GMT+0800 (China Standard Time)

@zhicwu I hope to create and close a client around every call to a method where actual queries are performed. That is, the reference is required to close the client as a clean up.

My intention is to pick a new ClickHouse node for every incoming HTTP request for the purpose of load balancing, where possibly multiple queries are performed. For that, I need to create a new client for each incoming HTTP request.

If you are uncomfortable w/ making getClient public, it's enough to have a public method for closing the associated client from a ClickHouseRequest.

I guess you may consider a request's holding a client is an implementation detail, but at least to me, it seems a detail that is unlikely to change.

Felix Mueller · Answer 14 · Mon Dec 13 2021 23:50:07 GMT+0800 (China Standard Time)

Building, Testing

Thanks @zhicwu for helping me to get the compile / tests running again. For reference:

Remove any generated sources in clickhouse-jdbc directory.
Create a ~/.m2/toolchains.xml like this

<toolchains>
  <toolchain>
    <type>jdk</type>
    <provides>
      <version>11</version>
    </provides>
    <configuration>
      <jdkHome>/usr/lib/jvm/java-11-openjdk</jdkHome>
    </configuration>
  </toolchain>
  <toolchain>
    <type>jdk</type>
    <provides>
      <version>8</version>
    </provides>
    <configuration>
      <jdkHome>/usr/lib/jvm/java-8-jdk</jdkHome>
    </configuration>
  </toolchain>
  <toolchain>
    <type>jdk</type>
    <provides>
      <version>17</version>
    </provides>
    <configuration>
      <jdkHome>/usr/lib/jvm/java-17-openjdk</jdkHome>
    </configuration>
  </toolchain>
</toolchains>

Run mvn -Drelease clean verify

DatabaseMetaData

I created issue #778 to review the use. I think that there aren't many clients relying on this information, but we should not make stuff up.

Zhichun Wu · Answer 15 · Tue Dec 14 2021 08:55:55 GMT+0800 (China Standard Time)

My intention is to pick a new ClickHouse node for every incoming HTTP request for the purpose of load balancing, where possibly multiple queries are performed. For that, I need to create a new client for each incoming HTTP request.

Thanks for the explanation. Are those multiple queries issued in same thread or separated ones? This is an example for the former case. The latter one can be tackled with chained Futures in a similar way. IMO, it's safer and easier to close the client once from outside, instead of passing the responsibility to request(s). It sounds like soon we'll need to maintain a reference counter to ensure the client can be only closed after all relevant requests completed. Maybe it's the similar reason of why in JDBC Statement does not have a method to return Connection, even we know they have connection :)

3 more items I can think of for the discussion:

load balancing - ClickHouseCluster is probably a bad example, but I hope it helps you understand that instead of connecting to a specific ClickHouseNode, you can connect to a function to get one. I'll re-write the class in 0.3.3 but let please feel free to share your thoughts.
closing client - as you probably noticed, calling client.close() may not close underlying connection. This really depends on 2 things: 1) whether we want to reuse connection(see here); 2) how we manage underlying connections(e.g. HttpURLConnection and HttpClient use different ways).
use clickhouse-http-client - if you think the API in clickhouse-client creates unnecessary overheads, you may consider to skip that and use clickhouse-http-client. I'm not joking, but if you're seeking more control for better performance, you may want to create http connection directly and pipe ClickHouse response to your caller.

Kim, DoHyung · Answer 16 · Tue Dec 14 2021 12:53:21 GMT+0800 (China Standard Time)

@zhicwu I got your point on not wanting to expose the client reference held in a request. Actually the method performing multiple queries is like a Spring MVC handler (I'm currently using Micronaut), and I wanted to inject ClickHouseRequest into the method and to avoid situations where the handler's not properly cleaning up the relevant ClickHouseClient.

So I'm turning to another implementation strategy where the ClickHouseClient is held separately from ClickHouseRequest and properly closed after completion of handlers, using AOP mechanism.

I have a few points to make on the items you listed above:

load balancing

I'm in the course of leveraging ClickHouseCluster in my code. Initially it was a bit cumbersome to create one from a configuration strings, and the fact that it is stateful (meaning used for tracking if each node is healty or not) bothered me a bit. But in conclusion, I now think the design is reasonable as a low level API.
closing client

As I made a point in my previous comment in this issue, AbstractClient tries very hard to make organized uses of underlying connection with an RW lock, but it seems flawed (?). While a request is about to be sent, the underlying connection might still be closed abruptly by another thread invoking connect to another node. So a client seems designed thread-safe, but in an important case of sharing it around and connecting to different nodes, it actually is not.

I'm not sure if we really need to reuse the same client instance when connecting to a different node. To me, it's more natural to have request objects hold their connection. In that case, it might be a bit weird to call it a request. Maybe better named "connection". But I might be missing some important points here.
I like the idea of programming against a common client interface. For instance, I might write another fully async client and it can be easily switched into.

Zhichun Wu · Answer 17 · Thu Dec 30 2021 00:32:47 GMT+0800 (China Standard Time)

So I'm turning to another implementation strategy where the ClickHouseClient is held separately from ClickHouseRequest and properly closed after completion of handlers, using AOP mechanism.

Sorry for the inconvinence at your end. I hope we can refine the API in 0.3.3 and mitigate situation like this.

I'm in the course of leveraging ClickHouseCluster in my code. Initially it was a bit cumbersome to create one from a configuration strings, and the fact that it is stateful (meaning used for tracking if each node is healty or not) bothered me a bit. But in conclusion, I now think the design is reasonable as a low level API.

In 0.3.2 I removed public modifier of the class as it's buggy and incomplete. I'll rewrite it in 0.3.3 and need to test against multiple nodes(same cluster or not).

As I made a point in my previous comment in this issue, AbstractClient tries very hard to make organized uses of underlying connection with an RW lock, but it seems flawed (?). While a request is about to be sent, the underlying connection might still be closed abruptly by another thread invoking connect to another node. So a client seems designed thread-safe, but in an important case of sharing it around and connecting to different nodes, it actually is not.

Very true. To be honest, I'm still struggling with whether a client should be able to connect to different nodes. In a cluster environment, each time you call client.execute, it may pick a different node for execution, some times for load balancing, some times for fail-over, regardless if you're using the exact same request object.

I'm not sure if we really need to reuse the same client instance when connecting to a different node. To me, it's more natural to have request objects hold their connection. In that case, it might be a bit weird to call it a request. Maybe better named "connection". But I might be missing some important points here.

The simpler the better. Let me see what I can do in 0.3.3 - maybe support both scenarios?