quux00 / hive-json-schema

Tool to generate a Hive schema from a JSON example doc

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support multiple JSONs

strelec opened this issue · comments

I find you project interesting, however, I think it lacks a key feature - the ability to deduce a schema from multiple json documents, one per line. Then you compute the "greatest common denominator" of all of them.

This removes a layer of human intervention (putting all the keys in one document). For the implementation details, you can check out this project:

https://github.com/strelec/hive-serde-gen

Hi strelec - nice project. I like also how yours balances the angle brackets. That can definitely make for easier reading.

Your proposal to deduce the schema from multiple json docs is definitely useful, but ultimately dangerous I think. There is no guarantee that your (presumably uncurated) set of docs actually covers all possible fields for your JSON structure. In which case you would generate a schema that worked initially but might fail some time later when doing a large query.

I prefer the model where people take the time to curate a known good JSON doc. This is basically the equivalent of doing the work to define your schema - just by example, rather than via formal schema.

I don't plan to add this feature. If someone wanted to contribute it, I would likely accept it, but I would put a strong caution against it on the README.

I have now a better version written in Scala, which is a JVM language, so this bug could be closed.

https://github.com/strelec/hive-serde-schema-gen

If you have any suggestion, it's very much appreciated. I appreciate your opinion a lot.