camille-sql

Run SQL over your Maven artifacts.

Slides from the talk "Talking SQL to Strangers".

What?

camille-sql allows you to explore Maven artifacts you have on your local hard drive.

Run the server:

$ bin/camille-server
Artifacts repository path: /Users/<user>/.m2/repository/
Running server on localhost:26727
...

The server understands PostreSQL wire protocol, so you can connect to it using standard psql client:

$ PGPASSWORD=nopass psql "host=localhost port=26727 sslmode=disable"
psql (12.2, server 9.5.0)
Type "help" for help.

camille=>

As you can see, psql is absolutely sure it talks to PostgreSQL version 9.5.0.

Now you have access to 2 tables: artifacts and versions. You can run any read-only SQL query: the server supports projections, filtering, grouping, joins, agg functions, sub-queries etc (pretty much all of SQL99).

Basic queries:

camille=> select * from artifacts limit 6;
    uid     |         group_id         |       artifact_id        |        name         |                   url
------------+--------------------------+--------------------------+---------------------+------------------------------------------
 3227713579 | alandipert               | desiderata               | desiderata          | https://github.com/alandipert/desiderata
 3382955103 | aopalliance              | aopalliance              | AOP alliance        | http://aopalliance.sourceforge.net
 1507835947 | asm                      | asm-parent               | ASM                 | http://asm.objectweb.org/
 226341444  | backport-util-concurrent | backport-util-concurrent | Backport of JSR 166 | http://backport-jsr166.sourceforge.net/
 1712481681 | biz.aQute                | bndlib                   | BND Library         | http://www.aQute.biz/Code/Bnd
 2280883480 | biz.aQute.bnd            | biz.aQute.bndlib         | biz.aQute.bndlib    | https://bnd.bndtools.org/
(6 rows)

camille=> select * from versions where filesize > 10000 limit 5;
    uid     | version | filesize |      last_modified      |                   sha1
------------+---------+----------+-------------------------+------------------------------------------
 3345961009 | 1.3.2   | 337129   | 2019-07-04 23:36:26.464 | ff84d15cfeb0825935a170d7908fbfae00498050
 1053708643 | 1.0.1   | 26514    | 2019-07-04 23:23:20.322 | 49c100caf72d658aca8e58bd74a4ba90fa2b0d70
 2740841946 | 1.6.5   | 1034049  | 2019-07-05 05:37:10.953 | 7d18faf23df1a5c3a43613952e0e8a182664564b
 925895164  | 0.4.4   | 42645    | 2020-02-01 06:45:59.599 | 2522f7f1b4bab169a2540406eb3eb71f7d6e3003
 136773645  | 1.9     | 263965   | 2019-07-04 23:25:30.09  | 9ce04e34240f674bc72680f8b843b1457383161a
(5 rows)

Something more complicated:

camille=>
SELECT group_id, COUNT(*) AS n_files
FROM artifacts
LEFT JOIN versions ON artifacts.uid=versions.uid
GROUP BY group_id
ORDER BY n_files DESC
LIMIT 10;
         group_id         | n_files
--------------------------+---------
 org.apache.flink         | 391
 org.apache.maven         | 245
 org.codehaus.plexus      | 186
 org.apache.hadoop        | 121
 org.apache.maven.doxia   | 108
 org.apache.maven.plugins | 82
 io.netty                 | 67
 org.apache.maven.shared  | 65
 org.apache.lucene        | 64
 org.apache.commons       | 62
(10 rows)

Why?

The project is mainly done out of pure curiosity:

figure out how does low-level PostgreSQL transport protocol (pgwire) look like
check on practice how simple or hard would it be to implement pgwire as a Netty codec
implement simple enough but not trivial example of defining relational algebra system using Apache Calcite
it's just fun and looks cool

Implementation Details

Netty to run async I/O server
Custom "codec" to encode/decode pgwire messages (see pgwire package). The tricky part of the codec is that very first message has a different structure compared to all following messages (from PostgreSQL documentation: because of purely historical reasons). Channel initializer creates pipeline with PgwireStartupMessageDecoder that will eventually remove itself after the first message is succesfully processed.
Server handler cycles over incomming SQL queries, decoding queries from bytes protocol and serializing result set into a proper sequence of messages (row descriptor -> row data -> command complete).
"Database" that actually executes query is implemented in m2sql package. It exposes JDBC connection, so the server uses standard java.sql interface when talking to it (see documentation for Apache Avatica library).
Apache Calcite is used for query parsing, query planning, query optimizaiton. High-level API is used to declare catalog structure, tables, schemas, relations and scanning logic.

More details in the deck.

Optimizations

"Precision is the difference between a butcher and a surgeon" (tm)

Implemented optimization:

prune unused fields: if "filesize" is not queried, we don't need to waste cpu/ram to calculate it
push-down filtering predicates: jump into a subfolder if prefix is known

Work in progress:

optimized join of artifacts and versions: we can walk files tree once to retrieve all the information we need

Contributions

This is the project made for fun. Feel free to implement whatever feature you want and just drop a PR here ;) See TODO list below if you need ideas on what could be helpful (or what is critically missing).

TODO

License

camille-sql is licensed under the MIT license, available at MIT and also in the LICENSE file.

kachayev / camille-sql