MattAlp / column

VLDB Availability

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Columnar Formats for Schemaless LSM-based Document Stores

This code was made available as expected for VLDB publications.

The published code is based on Apache AsterixDB and contains modifications to support the columnar format AMAX as well as the code generation framework.

This is only a prototype and a cleaner and a reviewed version will be made available soon at https://github.com/apache/asterixdb

Installation Requirements

For easier installation experience, you first need to install the following:

  • Maven
    Apache Maven is needed to compile AsterixDB. You need Maven 3.3.9 or newer
  • Ansible
    The link above shows how to use Ansible to deploy AsterixDB cluster. Using Ansible would be the easiest way to use the code in this repository as one of Ansible's YAML files has been modified to accommodate using GraalVM.
  • GraalVM
    Speaking of which, you need to install GraalVM for the code generation part. If you have Java, it is probably HotSpotVM and not GraalVM. GraalVM is required to compile the generated codes from SQL++ queries. Using Java HotSpotVM means the generated code will be executed in interpreter-mode.
    * Please download GraalVM v21.3 as this was the one we used and we know it works :-)
    ** You only need to download and unzip GraalVM. You do NOT have to make it your default JVM.

Compiling and Configuring AsterixDB

After cloning this repo:

$ git clone https://github.com/wailyk/column.git

Run the following commands:

$ cd asterixdb
$ mvn clean package -DskipTests

Now, copy the compiled AsterixDB code to your HOME folder:

$ cp asterixdb/asterix-server/target/asterix-server-0.9.8-SNAPSHOT-binary-assembly.zip ~/

Unzip asterix-server-0.9.8-SNAPSHOT-binary-assembly.zip

$ cd ~
$ unzip asterix-server-0.9.8-SNAPSHOT-binary-assembly.zip

Good! You need now to modify one file. Using you favorite editor, open:

apache-asterixdb-0.9.8-SNAPSHOT/opt/ansible/yaml/instance_start.yml

You need to change <PATH_TO_GRAALVM> to the location of GraalVM, for example:

  environment:
    JAVA_HOME: /home/wail/vms/graalvm-ce-java11-21.3.2

*Make sure you do that for the two commented YAML blocks (i.e., the one under [ncs] and [cc])

Save and close instance_start.yml

Running AsterixDB

run the following commands:

$ cd ~/apache-asterixdb-0.9.8-SNAPSHOT/opt/ansible/bin
$ ./deploy.sh

Make sure you can see the folder asterixdb_column under your HOME folder. This confirms that AsterixDB was deployed successfully in your machine.

Start AsterixDB:

$ ./start.sh

Using your favorite browser, open the following link: http://localhost:19006. You should see AsterixDB's web console.

Start using AsterixDB with Columnar format and Code Generation

DDL

Perfect! You can now start using AsterixDB. Let's start to load some data and try a few queries. First, run the following DDLs:

DROP DATAVERSE ColumnTest IF EXISTS;
CREATE DATAVERSE ColumnTest;
USE ColumnTest;

CREATE TYPE OpenType AS {
  uid: uuid
};

-- Create a Dataset using AsterixDB's default row format
CREATE DATASET RowDataset (OpenType)
PRIMARY KEY uid AUTOGENERATED;

-- Create a Dataset using our proposed columnar format
CREATE DATASET ColumnDataset (OpenType)
PRIMARY KEY uid AUTOGENERATED WITH {
  -- Tells AsterixDB to store the records as columns
  "dataset-format":{"format":"column"}
};

We provided a sample dataset with the code as well. It is a smaller version of the sensors dataset from the paper. Let's load the data into the two declared datasets RowDataset and ColumnDataset:

USE ColumnTest;
LOAD DATASET RowDataset USING localfs (
  ("path" = "localhost:///<HOME>/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json"),
  ("format" = "json")
);

LOAD DATASET ColumnDataset USING localfs (
  ("path" = "localhost:///<HOME>/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json"),
  ("format" = "json")
);

**Make sure to change <HOME> to the location of your <HOME> folder. For example, mine is home/wail. So the full path will be:

localhost:///home/wail/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json

QUERY

After loading the dataset, let us run the following queries:

Q1:

Row

USE ColumnTest;
SELECT VALUE COUNT(*)
FROM RowDataset;

Column

USE ColumnTest;
SELECT VALUE COUNT(*)
FROM ColumnDataset;

Column with Code Generation

USE ColumnTest;
SET `compiler.codegen` "true";
SELECT VALUE COUNT(*)
FROM ColumnDataset;

Q2:

Row

USE ColumnTest;
SELECT VALUE MAX(r.temp)
FROM RowDataset s, s.readings r;

Column

USE ColumnTest;
SELECT VALUE MAX(r.temp)
FROM ColumnDataset s, s.readings r;

Column with Code Generation

USE ColumnTest;
-- Enable code generation
SET `compiler.codegen` "true";
SELECT VALUE MAX(r.temp)
FROM ColumnDataset s, s.readings r;

Query Plans

From the web console, you can inspect the query plans. To have a detailed plan, change PLAN FORMAT to STRING. The dropdown menu for changing the plan format is located on top of the query box (where you write queries).

After executing a query, click PLAN, located in the result box, to show the query plan.

For example Q2's plan with the column + code generation looks like:

distribute result [$$45]
-- DISTRIBUTE_RESULT  |UNPARTITIONED|
  exchange
  -- ONE_TO_ONE_EXCHANGE  |UNPARTITIONED|
    aggregate [$$45] <- [agg-global-sql-max($$48)]
    -- AGGREGATE  |UNPARTITIONED|
      exchange
      -- RANDOM_MERGE_EXCHANGE  |PARTITIONED|
         data-scan [$$48]<-[$$44, $$s, $$46, $$r, $$41, $$48] <- ColumnTest.ColumnDataset code >>
          01| //reader0: {readings:[{temp:any}]}
          02| function ColumnDataset0Func (cursor, resultWriter, reader0) {
          03|    var0 = NULL;
          04|    while (cursor.next()) {
          05|       reader0.next();
          06|       while (!reader0.isEndOfArray()) {
          07|          var1 = reader0.getValue();
          08|          var0 = var0 /\ var1;
          09|          reader0.next();
          10|       }
          11|    }
          12|    append(resultWriter, var0);
          13|    flush(resultWriter);
          14| }

          -- DATASOURCE_SCAN  |PARTITIONED|
          exchange
          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
            empty-tuple-source
            -- EMPTY_TUPLE_SOURCE  |PARTITIONED|

Reproducing the Experiments Result

Unfortunately, due to the redistribution restriction, we cannot share most of the datasets used in our experiments. For example, Twitter restricts republishing data collected from its API. Also, due to their sizes, we also cannot host downloading the datasets used in our experiments. However, for the synthetic dataset sensors, we provided a sample that can be used to reproduce the sensors queries. We also added a flag that scales the data loaded into AsterixDB, which can mimic the results produced in the paper. To do so, you need to add the following configuration to the LOAD statement:

-- Scale factor = 4000
("expression" = "repeat,4000")

Thus, to load the data, you can do:

USE ColumnTest;
LOAD DATASET RowDataset USING localfs (
  ("path" = "localhost:///<HOME>/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json"),
  ("format" = "json"),
  ("expression" = "repeat,4000")
);
LOAD DATASET ColumnDataset USING localfs (
  ("path" = "localhost:///<HOME>/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json"),
  ("format" = "json"),
  ("expression" = "repeat,4000")
);

The scale factor 4000 produces a similar size used in our experiment.

Evaluate ingestion performance

To test the ingestion performance, you can use AsterixDB File Feed as in the following example for the ColumnDataset:

DROP DATAVERSE ColumnTest IF EXISTS;
CREATE DATAVERSE ColumnTest;
USE ColumnTest;

CREATE TYPE OpenType AS {
  uid: uuid
};

-- Create type for feed
CREATE TYPE FeedType AS { };

-- Create a Dataset using our proposed columnar format
CREATE DATASET ColumnDataset (OpenType)
PRIMARY KEY uid AUTOGENERATED WITH {
  -- Tells AsterixDB to store the records as columns
  "dataset-format":{"format":"column"}
};

-- Create a Feed
DROP FEED Feed0 IF EXISTS;
CREATE FEED Feed0 WITH {
  "adapter-name": "localfs",
  "type-name": "FeedType",
  "path": "localhost:///<HOME>/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json",
  "format": "json",
  -- Scale data
  "expression": "repeat,2000"
};

-- Create a second feed
DROP FEED Feed1 IF EXISTS;
CREATE FEED Feed1 WITH {
  "adapter-name": "localfs",
  "type-name": "FeedType",
  "path": "localhost:///<HOME>/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json",
  "format": "json",
  -- Scale data
  "expression": "repeat,2000"
};

-- Connect Feed0 to ColumnDataset
CONNECT FEED Feed0 TO DATASET ColumnDataset;
-- Connect Feed1 to ColumnDataset
CONNECT FEED Feed1 TO DATASET ColumnDataset;
-- Start both feeds (star ingesting the data)
START FEED Feed0;
START FEED Feed1;

*As in LOAD, make sure to change <HOME> to the location of your <HOME> folder.

**You can see here that we used two feeds to ingest the data -- similar to what we have used in the paper. You can connect more/less feeds as you see fit.

***The previous DDL is for the ColumnDataset. You can do the same for the RowDataset by simply removing "dataset-format":{"format":"column"} from the CREATE DATASET.

To measure the time:

Feeds in AsterixDB run in the background. Thus, we need to monitor when the ingestion is finished. To do so, run the following command:

$ tail -f ~/asterixdb_column/logs/nc-1.log | grep "Total ingestion time"

After ingestion

After ingestion, you need to ensure everything is flushed to disk. Thus, you need to run the following commands:

For the RowDataset:

$ curl http://localhost:19002/connector\?datasetName\=RowDataset\&dataverseName\=ColumnTest

For the ColumnDataset:

$ curl http://localhost:19002/connector\?datasetName\=ColumnDataset\&dataverseName\=ColumnTest

Query Sensors

After ingesting/loading the data, you can run the queries for the sensors data:

-- Q1
USE ColumnTest;
-- SET `compiler.codegen` "true";
SELECT VALUE COUNT(*)
FROM <DATASET>;

-- Q2
USE ColumnTest;
-- SET `compiler.codegen` "true";
SELECT VALUE MAX(r.temp)
FROM <DATASET> s, s.readings r;

--Q3
USE ColumnTest;
-- SET `compiler.codegen` "true";
SELECT sid , max_temp
FROM <DATASET> s , s.readings as r
GROUP BY s.sensor_id as sid
WITH max_temp as MAX(r.temp)
ORDER BY max_temp DESC
LIMIT 10;

-Q4
USE ColumnTest;
-- SET `compiler.codegen` "true";
SELECT sid , max_temp
FROM <DATASET> s , s.readings as r
WHERE s.report_time > 1556496000000
AND   s.report_time < 1556496000000 + 24 * 60 * 60 * 1000
GROUP BY s.sensor_id as sid
WITH max_temp as MAX(r.temp)
ORDER BY max_temp DESC
LIMIT 10

*You need to replace <DATASET> with either RowDataset or ColumnDataset.

** To enable the code generation, uncomment the following flag by removing --

-- SET `compiler.codegen` "true";

Stopping/Removing AsterixDB

To stop the cluster, run:

$ ./stop.sh

To delete the cluster, run:

$ ./erase.sh

AsterixDB's Documentation

For more information about AsterixDB, refer to AsterixDB's Documentation

Asterix Internal Language (AIL)

The implementation of the language (AIL) used in our code generation is a fork from Truffle's Simple Language

Issue?

Please email me (my email is on the paper :-)). Or you can open an issue here on GitHub. Again, please remember this is unreviewed code. It is only a research prototype and NOT meant to be used for production.

Community support

About

VLDB Availability


Languages

Language:Java 95.9%Language:Python 0.9%Language:Shell 0.8%Language:TypeScript 0.7%Language:JavaScript 0.7%Language:HTML 0.5%Language:FreeMarker 0.2%Language:Batchfile 0.2%Language:Smarty 0.1%Language:SCSS 0.1%Language:ANTLR 0.0%Language:CSS 0.0%Language:TeX 0.0%Language:Dockerfile 0.0%Language:Scheme 0.0%Language:Crystal 0.0%