ClickHouse / clickhouse-java

Java client and JDBC driver for ClickHouse

Home Page:https://clickhouse.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[HELP] 0.3.2 pre-release for public testing

zhicwu opened this issue · comments

Background

v0.3.2 was a minor release scheduled to be released months ago, but now it's a complete rewrite mainly for two reasons:

  1. decoupling(see #570 for details)

  2. switching data format to RowBinary to fix issues and improve performance
    Benchmark results...

    0.3.2-test1...
    • clickhouse-grpc-jdbc and clickhouse-http-jdbc are new JDBC driver(0.3.2) using RowBinary data format
    • clickhouse-jdbc is the old JDBC driver(0.3.1-patch) based on TabSeparated
    • clickhouse-native-jdbc is ClickHouse-Native-JDBC 2.6.0
      Benchmark settings: thread=1, sampleSize=100000, fetchSize=10000, mode=throughput(ops/s).
      image
    0.3.2-test3...

    Unlike previous round of testing, ClickHouse container is re-created a few minutes before benchmarking each driver.

    • Single thread
      • Comparison
        image
        Note: HttpClient is async(uses more than one thread in runtime); gRPC uses gzip(why?) which is slower than lz4.
      • VM utilization
        image
        Note: on client side, the new driver consumes less memory and CPU than others, BUT higher CPU on server side(due to overhead of http protocol?).
    • 4 threads
      • Comparison
        image

      • VM utilization
        image

    0.3.2...

    image

    Query performance is similar as shown in 0.3.2-test3 so this time we only focus on insertion.
    image
    Note: gRPC does not support LZ4 compression so we use GZIP in the test.

    • Single thread
      image
    • 4 threads
      image

0.3.2-test1, 0.3.2-test2, and 0.3.2-test3 are pre-release for public testing.

Downloads

Maven dependency:

<dependency>
    <!-- will stop using group id "ru.yandex.clickhouse" starting from 0.4.0  -->
    <groupId>com.clickhouse</groupId>
    <!-- or clickhouse-grpc-client to use gRPC client  -->
    <artifactId>clickhouse-http-client</artifactId>
    <version>0.3.2-test3</version>
</dependency>

To download JDBC drivers:

Package Size Legacy New HTTP gRPC Remark
clickhouse-jdbc-0.3.2-all.jar 18.6MB Y Y Y Y Both old and new JDBC drivers(besides netty, okhttp is included as well)
clickhouse-jdbc-0.3.2-http.jar 756KB N Y Y N New JDBC driver with only http support
clickhouse-jdbc-0.3.2-grpc.jar 17.3MB N Y N Y New JDBC driver with only grpc support(only netty, okhttp is excluded)
clickhouse-jdbc-0.3.2-shaded.jar 2.8MB Y Y Y N Both old and new JDBC drivers

Note: the first two are recommended. grpc is experimental so you'd better use http.

Known Issues

  • new driver(com.clickhouse.jdbc.ClickHouseDriver) does not work with version before 21.3
  • java.io.IOException: HTTP/1.1 header parser received no bytes when using JDK 11+ and http_connection_provider is set to HTTP_CLIENT
  • RESOURCE_EXHAUSTED: Compressed gRPC message exceeds maximum size - increase max_inbound_message_size to resolve
  • select 1 format JSON works in http but not grpc, because grpc client is not aware of response format
  • insert into table values(?, ?) is slow in batch mode - try insert into table select c2,c3 from input('c1 String, c2 UInt8, c3 Nullable(UInt32)') instead
  • use_time_zone and use_server_time_zone_for_dates properties do not work
  • no table/index show up under jdbc(*) database
  • roaringbitmap is not included in the shaded jar

Key Changes

  • Java client and JDBC driver are now in different modules, along with JPMS support
  • Replaced data format from TabSeparated to RowBinary
  • Support more data types including Date32, Geo types, and mixed use of nested types
  • JDBC connection URL now supports abbrebation, protocol and optional port
    • jdbc:ch://localhost is same as jdbc:clickhouse:http://localhost:8123
    • jdbc:ch:grpc://localhost/db is same as jdbc:clickhouse:grpc://localhost:9100/db
  • New JDBC driver class is com.clickhouse.jdbc.ClickHouseDriver(will remove ru.yandex.clickhouse.ClickHouseDriver starting from 0.4.0)
  • JDBC connection properties are simplified
    • use custom_http_headers and custom_http_params for customization - won't work for grpc client
    • jdbcCompliant(defaults to true) to support fake transaction and standard synchronous UPDATE and DELETE statements
    • typeMappings to customize type mapping(e.g. DateTime=java.lang.String,DateTime32=java.lang.String)

Some more details can be found at #736, #747, #769, and #777.

Hello, @zhicwu
I tried a new driver in Intellij DataGrip and noticed that the DateTime format is different from that in version 0.3.1.

  • SELECT now()
    0.3.1-patch - 2021-12-01 19:27:12
    0.3.2-test1 - 2021-12-01T19:27:12

  • SELECT now('Europe/Paris')
    0.3.1-patch - 2021-12-01 17:27:15
    0.3.2-test1 - 2021-12-01T17:27:15+01:00

Hello, @zhicwu I tried a new driver in Intellij DataGrip and noticed that the DateTime format is different from that in version 0.3.1.

  • SELECT now()
    0.3.1-patch - 2021-12-01 19:27:12
    0.3.2-test1 - 2021-12-01T19:27:12
  • SELECT now('Europe/Paris')
    0.3.1-patch - 2021-12-01 17:27:15
    0.3.2-test1 - 2021-12-01T17:27:15+01:00

Thank you for testing the new driver and reporting the issue here. I'll try to get it fixed in next build. Actually I like the way DBeaver handle LocalDateTime and especially OffsetDateTime:
image

Update:
we cannot change behavior of DataGrip, but you can update connection properties by changing typeMappings in 0.3.2-test3 as a workaround:
image

This is one of the crucial parts of the usage:

  • Are clients simply calling getString?
  • Are clients calling different methods depending on the type of the column reported? If so, which methods do they use?
    • getObject(int)
    • getObject(int, Class<?> -- which class?
    • getTime(int)
    • getDate(int)
    • getTimestamp(int)

This is one of the crucial parts of the usage:

  • Are clients simply calling getString?

  • Are clients calling different methods depending on the type of the column reported? If so, which methods do they use?

    • getObject(int)
    • getObject(int, Class<?> -- which class?
    • getTime(int)
    • getDate(int)
    • getTimestamp(int)

Looks like a combination of getObject() and then convert LocalDateTime(timestamp without time zone)/OffsetDateTime(timestamp with time zone) to string. DBeaver on the other hand has a display issue - submitted dbeaver/dbeaver#14772 to track status.

It seems that the Java client is currently relying on thread pools underneath to implement its async API, I mean, instead of being fully async at its core. Is it tentative or intended?

Also if I make a request to a node and try to make another request w/ the same client to a different node, then depending on the timing, the former call may fail since the second request may close the underlying connection too early. In that regard, I'm not sure what's the intended usage pattern of the Java client API.

For instance, if I want to make a few requests to ClickHouse in the context of an incoming HTTP call, then what am I supposed to reuse across the requests to a ClickHouse node? ClickHouseClient? ClickHouseRequest?

UPDATE: I had a look at ClickHouseConnectionImpl and it seems appropriate to hold a ClickHouseRequest (A reference to a ClickHouseClient is held by it).

Thanks @dynaxis, these are all good points.

It seems that the Java client is currently relying on thread pools underneath to implement its async API, I mean, instead of being fully async at its core. Is it tentative or intended?

Unfortunately this is intended, because JDBC driver is built on top of the client, meaning we prefer least dependency and we still need to support JDK 8. I hope we can find somewhere in the middle - a compact lib to serve very basic functions for both JDBC and R2DBC drivers.

For instance, if I want to make a few requests to ClickHouse in the context of an incoming HTTP call, then what am I supposed to reuse across the requests to a ClickHouse node? ClickHouseClient? ClickHouseRequest?

Yes, you can reuse ClickHouseRequest. Each time you call its execute()/send() method, it will create a sealed copy for the execution, which is similar as copy-on-write data structure for thread safety. On the other, ClickHouseClient is responsible for handling protocol-specific details like how to execute a request and get response. Taking http as an example, depending on whether the concrete http connection(e.g. HttpURLConnection) is reusable, it may suggest to create new connection for each request or simply reuse the same one.

  • Building the current master version using mvn install leads to ca. 6 failures. Will investigate further.
  • Throwing Exceptions in setAutocommit(false) breaks the Metabase driver. However, I understand that we we would like to inform the client early on that we do not support transactions. I will see what I can do about it.
  • Building the current master version using mvn install leads to ca. 6 failures. Will investigate further.
  • Throwing Exceptions in setAutocommit(false) breaks the Metabase driver. However, I understand that we we would like to inform the client early on that we do not support transactions. I will see what I can do about it.

Did you mean the version on master branch? We should test the one on develop branch and fake transaction should work.

DatabaseMetaData behavior is a breaking. Perhaps the new behavior is "more correct", but it is a change users need to be aware of. Compare results of this program:

    public static void main(String[] args) throws Exception {
        String url = "jdbc:clickhouse://localhost:8123/default";
        ClickHouseDataSource dataSource = new ClickHouseDataSource(url);
        try (ClickHouseConnection conn = dataSource.getConnection()) {
            DatabaseMetaData meta = conn.getMetaData();
            ResultSet rs = meta.getTables("foo", null, "%", new String[] {"TABLE", "VIEW", "FOREIGN TABLE", "MATERIALZED VIEW"});
            ResultSetMetaData rsMeta = rs.getMetaData();
            while (rs.next()) {
                for (int i = 0; i < rsMeta.getColumnCount(); i++) {
                    System.out.println(rsMeta.getColumnName(i+1) + ": " + rs.getString(i+1));
                }
            }
        }
    }

I'm not sure this is the right place to comment on the current pre-release. Anyway, how about making ClickHouseRequest.getClient() public? I'm injecting a ClickHouseRequest to every invocation of HTTP request handlers and it's cumbersome to pass ClickHouseClient around together w/ it.

DatabaseMetaData behavior is a breaking. Perhaps the new behavior is "more correct", but it is a change users need to be aware of. Compare results of this program:

Yes, I added some missing parts and made changes as well to make the driver more JDBC-compliant.

I'll try to emphasize the change in release notes - all in all, 0.3.2 is not a drop-in replacement of previous version, despite its version(just trying to stick with the roadmap).

Apart from that, as you may notice, more table types were added. I was kind of hoping DBeaver can support that like in SQuirrel SQL(see dbeaver/dbeaver#14773 ).

I'm not sure this is the right place to comment on the current pre-release. Anyway, how about making ClickHouseRequest.getClient() public? I'm injecting a ClickHouseRequest to every invocation of HTTP request handlers and it's cumbersome to pass ClickHouseClient around together w/ it.

Yes it's the right place and thanks for the feedback.

how about making ClickHouseRequest.getClient() public?

I thought about this a few times as well. In the very beginning, there's a Context object as glue to connect almost everything(client, config, request and response), and then later I removed it as it became too heavy and complex. Later I made request.getClient public for convenience, and then I reverted the change as combination of copy and execute() / send() methods are sufficient enough to my problems.

What's your case of making the method public? If it's about issuing more queries, you can try something like baseRequest.copy().query("sql").execute()?

@zhicwu I hope to create and close a client around every call to a method where actual queries are performed. That is, the reference is required to close the client as a clean up.

My intention is to pick a new ClickHouse node for every incoming HTTP request for the purpose of load balancing, where possibly multiple queries are performed. For that, I need to create a new client for each incoming HTTP request.

If you are uncomfortable w/ making getClient public, it's enough to have a public method for closing the associated client from a ClickHouseRequest.

I guess you may consider a request's holding a client is an implementation detail, but at least to me, it seems a detail that is unlikely to change.

Building, Testing

Thanks @zhicwu for helping me to get the compile / tests running again. For reference:

  1. Remove any generated sources in clickhouse-jdbc directory.
  2. Create a ~/.m2/toolchains.xml like this
<toolchains>
  <toolchain>
    <type>jdk</type>
    <provides>
      <version>11</version>
    </provides>
    <configuration>
      <jdkHome>/usr/lib/jvm/java-11-openjdk</jdkHome>
    </configuration>
  </toolchain>
  <toolchain>
    <type>jdk</type>
    <provides>
      <version>8</version>
    </provides>
    <configuration>
      <jdkHome>/usr/lib/jvm/java-8-jdk</jdkHome>
    </configuration>
  </toolchain>
  <toolchain>
    <type>jdk</type>
    <provides>
      <version>17</version>
    </provides>
    <configuration>
      <jdkHome>/usr/lib/jvm/java-17-openjdk</jdkHome>
    </configuration>
  </toolchain>
</toolchains>
  1. Run mvn -Drelease clean verify

DatabaseMetaData

I created issue #778 to review the use. I think that there aren't many clients relying on this information, but we should not make stuff up.

My intention is to pick a new ClickHouse node for every incoming HTTP request for the purpose of load balancing, where possibly multiple queries are performed. For that, I need to create a new client for each incoming HTTP request.

Thanks for the explanation. Are those multiple queries issued in same thread or separated ones? This is an example for the former case. The latter one can be tackled with chained Futures in a similar way. IMO, it's safer and easier to close the client once from outside, instead of passing the responsibility to request(s). It sounds like soon we'll need to maintain a reference counter to ensure the client can be only closed after all relevant requests completed. Maybe it's the similar reason of why in JDBC Statement does not have a method to return Connection, even we know they have connection :)

3 more items I can think of for the discussion:

  1. load balancing - ClickHouseCluster is probably a bad example, but I hope it helps you understand that instead of connecting to a specific ClickHouseNode, you can connect to a function to get one. I'll re-write the class in 0.3.3 but let please feel free to share your thoughts.

  2. closing client - as you probably noticed, calling client.close() may not close underlying connection. This really depends on 2 things: 1) whether we want to reuse connection(see here); 2) how we manage underlying connections(e.g. HttpURLConnection and HttpClient use different ways).

  3. use clickhouse-http-client - if you think the API in clickhouse-client creates unnecessary overheads, you may consider to skip that and use clickhouse-http-client. I'm not joking, but if you're seeking more control for better performance, you may want to create http connection directly and pipe ClickHouse response to your caller.

@zhicwu I got your point on not wanting to expose the client reference held in a request. Actually the method performing multiple queries is like a Spring MVC handler (I'm currently using Micronaut), and I wanted to inject ClickHouseRequest into the method and to avoid situations where the handler's not properly cleaning up the relevant ClickHouseClient.

So I'm turning to another implementation strategy where the ClickHouseClient is held separately from ClickHouseRequest and properly closed after completion of handlers, using AOP mechanism.

I have a few points to make on the items you listed above:

  1. load balancing

    I'm in the course of leveraging ClickHouseCluster in my code. Initially it was a bit cumbersome to create one from a configuration strings, and the fact that it is stateful (meaning used for tracking if each node is healty or not) bothered me a bit. But in conclusion, I now think the design is reasonable as a low level API.

  2. closing client

    As I made a point in my previous comment in this issue, AbstractClient tries very hard to make organized uses of underlying connection with an RW lock, but it seems flawed (?). While a request is about to be sent, the underlying connection might still be closed abruptly by another thread invoking connect to another node. So a client seems designed thread-safe, but in an important case of sharing it around and connecting to different nodes, it actually is not.

    I'm not sure if we really need to reuse the same client instance when connecting to a different node. To me, it's more natural to have request objects hold their connection. In that case, it might be a bit weird to call it a request. Maybe better named "connection". But I might be missing some important points here.

  3. I like the idea of programming against a common client interface. For instance, I might write another fully async client and it can be easily switched into.

So I'm turning to another implementation strategy where the ClickHouseClient is held separately from ClickHouseRequest and properly closed after completion of handlers, using AOP mechanism.

Sorry for the inconvinence at your end. I hope we can refine the API in 0.3.3 and mitigate situation like this.

I'm in the course of leveraging ClickHouseCluster in my code. Initially it was a bit cumbersome to create one from a configuration strings, and the fact that it is stateful (meaning used for tracking if each node is healty or not) bothered me a bit. But in conclusion, I now think the design is reasonable as a low level API.

In 0.3.2 I removed public modifier of the class as it's buggy and incomplete. I'll rewrite it in 0.3.3 and need to test against multiple nodes(same cluster or not).

As I made a point in my previous comment in this issue, AbstractClient tries very hard to make organized uses of underlying connection with an RW lock, but it seems flawed (?). While a request is about to be sent, the underlying connection might still be closed abruptly by another thread invoking connect to another node. So a client seems designed thread-safe, but in an important case of sharing it around and connecting to different nodes, it actually is not.

Very true. To be honest, I'm still struggling with whether a client should be able to connect to different nodes. In a cluster environment, each time you call client.execute, it may pick a different node for execution, some times for load balancing, some times for fail-over, regardless if you're using the exact same request object.

I'm not sure if we really need to reuse the same client instance when connecting to a different node. To me, it's more natural to have request objects hold their connection. In that case, it might be a bit weird to call it a request. Maybe better named "connection". But I might be missing some important points here.

The simpler the better. Let me see what I can do in 0.3.3 - maybe support both scenarios?