trinodb / tpch

Port of TPC-H dbgen to Java

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Column names do not obey the TPC-H specification

kokosing opened this issue · comments

TPC-H specification uses prefix for each column name. For example nation columns are named:

NATION Table Layout
Column Name Datatype Requirements Comment
N_NATIONKEY identifier 25 nations are populated
N_NAME fixed text, size 25
N_REGIONKEY identifier Foreign Key to R_REGIONKEY
N_COMMENT variable text, size 152
Primary Key: N_NATIONKEY

airlift tpch defines nation columns as:

    NATION_KEY("nationkey", TpchColumnTypes.IDENTIFIER) {
        public long getIdentifier(Nation nation) {
            return nation.getNationKey();
        }
    },
    NAME("name", TpchColumnTypes.varchar(25L)) {
        public String getString(Nation nation) {
            return nation.getName();
        }
    },
    REGION_KEY("regionkey", TpchColumnTypes.IDENTIFIER) {
        public long getIdentifier(Nation nation) {
            return nation.getRegionKey();
        }
    },
    COMMENT("comment", TpchColumnTypes.varchar(152L)) {
        public String getString(Nation nation) {
            return nation.getComment();
        }
    };

See lacking n_ in column names.

This causes that TPC-H queries cannot be simply generated and then executed in Presto, but require all the column names to be modified.

Specification file: http://cs.fit.edu/~pbernhar/teaching/databases/tpch.pdf

I hit this as well when working on tpch queries. I'd like to test a property of presto for all of those queries and I'm unable without fixing this. I'll be glad to provide a fix. @martint please let me know if there are any reasons not to add the prefixes and/or any other things that need consideration.

It's nice to have names without the prefixes for convenience (when writing queries by hand, etc), but it also makes sense for them to be as defined by the spec. Maybe, we can tag each field with it's original name and the user-friendly name. In Presto, we could add an option in the connector to switch between the two modes (e.g. strict vs non-strict).

@dain, any thoughts?

@martint To do that we'd have to define each column in tpch generator twice, right?

Personally, I think having the names consistent with the tests and the spec pays off more than having the - I admit - nice, noise-free names. I imagine that after writing a query to test sth ad-hoc devs might be discouraged to reuse it as a test because of the needed prefix amendments.

Not necessarily. We could do it in one of two ways:

  1. Extend the enum class to take the original name and the user-friendly name:

    NATION_KEY("n_nationkey", "nationkey", TpchColumnTypes.IDENTIFIER)

  2. If all names are consistent, just have the original prefixed name here and strip the prefix in the connector if non-strict mode is selected.

You can add a constant prefix to each TpchColumn implementation and then have a getter with the unprefixed name and the prefixed name.

As for how to handle this in Presto, you could add hidden columns to alias the names (BTW I'm fine with a strict mode in the connector proposal)

I'd avoid using hidden columns. They can mess up a bunch of things like physical properties, describe, etc.