AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generic data structures to hold a parsed EBCDIC record

yruslan opened this issue · comments

Background

Currently, the only way to extract data from an EBCDIC file fusing Cobrix is to use Spark. While Spark is great for big data workflows, other workflows would benefit on a more generic way to parse EBCDIC data.

Feature

Create generic data structures for holding parsed EBCDIC record data.

I'm interested in this feature too. Let's discuss what's the best way to do this. Ideally this could be done together with the effort of outputting standard files from cobrix such as JSON or CSV.

Absolutely, this is exactly the idea. I can create a design document on a Google Drive (for instance) so we can comment and come up with the design that takes into account requirements from our and your side.

Looking forward to the collaboration.

Revisiting this. I was looking through the code and wondering whether we could:

  • move the RowExtractors object to the cobol-parser or potentially to a new intermediate package, say cobol-reader. It seems that the Spark portion comes only at the end of each method, so we could return a generic scala object and create a thin wrapper on the spark-cobol package that simply converts the fields to Row
  • move the readers and iterators to cobol-parser/cobol-reader
  • let the spark-cobol package simply convert the generic scala rows to Spark rows
  • create other converters either in the cobol-parser/cobol-reader or as separate packages to output to JSON and CSV (for non-nested structs)

All sounds reasonable and it looks like a logical evolution of the project. It would be great if we could decrease coupling and make the project more modular and re-usable for other frameworks. The only thing we should keep in mind I think is we need to preserve performance.

I'm thinking if cobol-reader can provide generic methods that would allow conversion to Spark's Row in the reader as a single step, without an intermediate data structure. I'm also thinking of adding performance tests to make sure performance won't degrade after the re-arrangement.

@yruslan, take a look at https://github.com/tr11/cobrix/tree/refactor-readers, it does what I mentioned in #184 (comment).
With these changes we have:

  • parser, reader, spark packages. Each handles one part of the process
  • the row conversion happens as a single step with no intermediate structures
  • it's possible to call the reader directly and extract data as a nested Seq or Array.

I should have a JSON record builder at some point in the next couple of weeks. Probably as an example app as there's no need to add extra dependencies to the reader.

I've looked through the changes and it looks perfect! Will give it a more thorough look tomorrow, I might have a couple of questions.

I added a test with a potential JSON, XML, and CSV implementation for record builders. A few questions I have:

  1. Should the reader package be merged into the parser? There are no extra dependencies and it's unlikely anyone would use the parser without a reader.
  2. Is there a need to create a serializers package similar to the spark-cobol package? The test I mentioned uses a Map[String, Any] to hold the data and pass it to jackson-databind. Maybe an example could suffice instead of replicating a lot of what's done on the Spark side.

Looking at it...

  1. Should the reader package be merged into the parser? There are no extra dependencies and it's unlikely anyone would use the parser without a reader.

Yes, I agree. It can be in the same module, but in different packages. If for any reason we would like to split it, we could do it any time.

  1. Is there a need to create a serializers package similar to the spark-cobol package? The test I mentioned uses a Map[String, Any] to hold the data and pass it to jackson-databind. Maybe an example could suffice instead of replicating a lot of what's done on the Spark side.

I think, yes here as well. Direct conversion to JSON, XML and CSV might be very useful as something supported by the library, than just an example. Even if it is a very small module. I expect it can be used to bridge EBCDIC-encoded IBM MQ messages with other messaging systems.

I'll let you go through it first and will merge the parser and reader modules after you're done.

For 2, what do you think of cobol-serializer name for the new package? I can set up the Reader classes akin to what's done in the spark side right now and then we can think about how to pass options to those.

cobol-serializer seems alright, but gives an impression that something can be serialized to Cobol. Maybe cobol-converters? It implies that Cobol data can be converted to various sources. The fact that serializers are used for the conversion can be considered a technical detail. What do you think?

Good point, cobol-converters it is!

Would it be useful to create a PR with these changes for comments and suggestions?

👍 Of course

Sorry for the late response.