- Generates a
create table
statement for AWS Athena from an avro schema. - The statement can be used to create an Athena table from (partitioned) avro files.
- See the example in
run_example.py
. It works inpython3.7
withavro-python3==1.9.1
. - The partition statement is optional. It should be of the form
PARTITIONED BY (year string, month string, day string)
. If you don't have partitions, setpartition_statement = ''
. - It is assumed that the most outer schema type is
Record
. - Aliases in the avro schema are allowed.
- The schema tree is analyzed recursively, so trees of arbitrary depth are allowed.
Taking the following standard example schema as input:
{
"namespace": "com.linkedin.haivvreo",
"name": "test_serializer",
"type": "record",
"fields": [
{ "name":"string1", "type":"string" },
{ "name":"int1", "type":"int" },
{ "name":"tinyint1", "type":"int" },
{ "name":"smallint1", "type":"int" },
{ "name":"bigint1", "type":"long" },
{ "name":"boolean1", "type":"boolean" },
{ "name":"float1", "type":"float" },
{ "name":"double1", "type":"double" },
{ "name":"list1", "type":{"type":"array", "items":"string"} },
{ "name":"map1", "type":{"type":"map", "values":"int"} },
{ "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [
{ "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } },
{ "name":"union1", "type":["float", "boolean", "string"] },
{ "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} },
{ "name":"nullableint", "type":["int", "null"] },
{ "name":"bytes1", "type":"bytes" },
{ "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} }
] }
we obtain the following Athena create table
output:
CREATE EXTERNAL TABLE IF NOT EXISTS
`my_database`.`my_table`
(`string1` string, `int1` int, `tinyint1` int, `smallint1` int, `bigint1` bigint, `boolean1` boolean, `float1` float, `double1` double, `list1` array<string>, `map1` map<string,int>, `struct1` struct<sint:int,sboolean:boolean,sstring:string>, `union1` float, `enum1` string, `nullableint` int, `bytes1` bytes, `fixed1` string)
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal'='{"namespace": "com.linkedin.haivvreo", "name": "test_serializer", "type": "record", "fields": [{"name": "string1", "type": "string"}, {"name": "int1", "type": "int"}, {"name": "tinyint1", "type": "int"}, {"name": "smallint1", "type": "int"}, {"name": "bigint1", "type": "long"}, {"name": "boolean1", "type": "boolean"}, {"name": "float1", "type": "float"}, {"name": "double1", "type": "double"}, {"name": "list1", "type": {"type": "array", "items": "string"}}, {"name": "map1", "type": {"type": "map", "values": "int"}}, {"name": "struct1", "type": {"type": "record", "name": "struct1_name", "fields": [{"name": "sInt", "type": "int"}, {"name": "sBoolean", "type": "boolean"}, {"name": "sString", "type": "string"}]}}, {"name": "union1", "type": ["float", "boolean", "string"]}, {"name": "enum1", "type": {"type": "enum", "name": "enum1_values", "symbols": ["BLUE", "RED", "GREEN"]}}, {"name": "nullableint", "type": ["int", "null"]}, {"name": "bytes1", "type": "bytes"}, {"name": "fixed1", "type": {"type": "fixed", "name": "threebytes", "size": 3}}]}')
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 's3://my_bucket/my_folder/'
- For a
Primitive
,long
is translated tobigint
. Array
becomesarray<>
.Map
becomesmap<>
.Record
becomesstruct<>
.- For a
Union
, the first nonnull
type is chosen. Enum
andFixed
becomestring
, respectively.
-
Unions with mutliple primitive types: an example is
["float", "boolean", "string"]
. In general, these kind of data types might cause trouble in Athena. Our schema creator will pick the first elementfloat
form the union. Consider changing such multiple types tostring
. -
Infer the schema directly from
.avro
data files: seeutil/avro_file_schema_parser.py
for how to do that from a single file.- Useful if, for whatever reason, the schema of the files is not available.
- Useful if Athena does not like the schema for some reason - even though it works with other systems. We experienced examples where inferring the schema from a data file solves the problem. In this case in
create_athena_table_statement_from_avsc
replaceparse_literal_schema_from_file
byinfer_schema_from_avro_file
.