julianpeeters / avrohugger

Generate Scala case class definitions from Avro schemas

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Schema evolution

imarios opened this issue · comments

I was wondering how schema evolution can be handled. For example, will I be able to read an earlier serialized version (written with an older Avro Schema) into a new case class (generated with a later version of the same Schema)?

Yes, with respect to evolution, the Scala class will behave just as a Java class would (Note that there are nevertheless valid and invalid Schema evolutions).

Here is a very good and detailed blog post about that topic:
http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Although I've had some issues with removing fields.

@mariussoutier Thanks!! I read it and it was such a great read! However, from his text it's not clear to me how the generated classes evolve. I know that the APIs to the serialized binaries have ways to resolve evolutional changes, however I am not exactly sure how that would be resolved in the generated (for example) case classes.

@julianpeeters thanks for the reply! I think I need to create some toy examples and experiment a bit to understand this properly. Essentially I want most schema evolution issues to be caught during compilation (if this is possible). For example, what would happen when I try to deserialized to case class Person(a: String, b: String, j: Int) but the data where serialized using case class Person(a: String, j: Int)? I want to test these scenarios out and see how they behave. Any input on this is greatly appreciated. Thanks again!

Ah I see, as he mentions, the ordering of the fields does not matter in Avro, so the generated classes should not have a problem. Especially since the Avro contract requires an empty constructor and mutable fields. I will also do some more testing in the near future with regards to that.

It is quite hard to catch these type errors at compile time. In general, if you want everything to always work without always updating consumers or producers first, you need to guarantee full compatibility, which pretty much boils down to:

  • Backwards compatibility - Ensure that every new field has s default value
  • Forwards compatibility - Never remove a field that doesn't have a default value. This is the one most important to specific records.

This is an oversimplification, so make sure to test the cases you're looking at supporting. Recently came across this with a good Avro overview (jump to minute 20 or so):
https://www.youtube.com/watch?v=GfJZ7duV_MM

In my CI pipeline as part of my schema build I fire up docker containers that for my dev/qa/prod confluent schema registries which bootstrap themselves with schemas from that environment and then use the REST APIs to test compatibility with my branch. I already had most of the infra in place to easily do this, likely simpler ways if starting from a clean slate (and especially if you aren't using schema registry/kafka).

I think there are command line tools to test compatibility, but I couldn't find them with a quick search.

@ppearcy thanks for the reply! I was wondering if something similar to what the SBT team is doing can help here sbt-datatype ... not sure, but what they do looks interesting.

I haven't delved too deeply into evolution myself, but in addition to Paul's advice re default values, here are a couple more rules for schema evolution:

  • Can add or change (or remove?) the namespace - as long as there are no fields whose types are records within unions.
  • Can't change the type of a field.

But it's worth noting that if you run into an impossible evolution, you can (always?) deserialize into a GenericRecord and map the fields into your new case class by hand.

So your example should deserialize as long as you define the new case class such that the new field has a default value.

case class Person(a: String, b: String = "", j: Int)

Interesting links indeed. I'm on the Kleppmann/Confluent bandwagon as well, but it interesting that Eric Sammer's Rocana uses a fairly basic, unchanging schema, handling everything as a event (also around 20 min or so): https://www.youtube.com/watch?v=lYbyjF4a4uU.

Oh no, don't get me started on event design/structure :)

Sorry for digressing into a side chat, but I had my head down in structuring events for a couple of months and didn't have many people to nerd out about with.

Once you choose to use a map any contract about what is there becomes external to the schema. It can definitely make sense in many cases and is a necessary fallback if you truly don't know what you are going to receive. As long as the policy is only ever add new key/value pairs can be relatively safe, but you lose out on anyone being able to easily understand what an event may or may not contain without going to the data itself. This will also inevitably lead to the same data being inconsistently named in the k/v pairs unless there is a separate process. Pretty much all the same arguments of dynamic vs static languages seem apply here.

I prefer a nested event structure with a few different core records that every event is guaranteed to have (action, source, actor) and then add on extra entity records for the various dimensions involved and the actual target of the action and ensure these are kept consistent. It has a very impact on code compactness/re-use, too if you have Builder objects for each of these sub-components to an event.

I'm dealing mostly with user activity and trying to keep producers/consumers of data implicitly coordinated. If I had more raw metric data, I might lean towards a more k/v centric approach, though.

I found this thread very helpful, perhaps some excerpts should be posted in a section on the README?

I found this thread very helpful, perhaps some excerpts should be posted in a section on the README?

Sure! But I've got blinders about which parts were useful, and whether or not the README is getting too long for newcomers. Would you mind opening up a rough PR, or an issue with some quick notes?

Part of me feels that this is out of scope and may be best as a blog post somewhere, but I'm all for making Avro easier for people to use.

I like your idea of having this on a blog post with a link to it under something along the lines of a Further Reading section.