Generic data structures to hold a parsed EBCDIC record

Question

Generic data structures to hold a parsed EBCDIC record

yruslan opened this issue 5 years ago · comments

Background

Currently, the only way to extract data from an EBCDIC file fusing Cobrix is to use Spark. While Spark is great for big data workflows, other workflows would benefit on a more generic way to parse EBCDIC data.

Feature

Create generic data structures for holding parsed EBCDIC record data.

Tiago Requeijo · Answer 1 · Sat Jan 11 2020 20:35:14 GMT+0800 (China Standard Time)

I'm interested in this feature too. Let's discuss what's the best way to do this. Ideally this could be done together with the effort of outputting standard files from cobrix such as JSON or CSV.

Ruslan Yushchenko · Answer 2 · Sun Jan 12 2020 00:13:21 GMT+0800 (China Standard Time)

Absolutely, this is exactly the idea. I can create a design document on a Google Drive (for instance) so we can comment and come up with the design that takes into account requirements from our and your side.

Looking forward to the collaboration.

Tiago Requeijo · Answer 3 · Fri Mar 27 2020 10:12:44 GMT+0800 (China Standard Time)

Revisiting this. I was looking through the code and wondering whether we could:

move the RowExtractors object to the cobol-parser or potentially to a new intermediate package, say cobol-reader. It seems that the Spark portion comes only at the end of each method, so we could return a generic scala object and create a thin wrapper on the spark-cobol package that simply converts the fields to Row
move the readers and iterators to cobol-parser/cobol-reader
let the spark-cobol package simply convert the generic scala rows to Spark rows
create other converters either in the cobol-parser/cobol-reader or as separate packages to output to JSON and CSV (for non-nested structs)

Ruslan Yushchenko · Answer 4 · Fri Mar 27 2020 17:47:40 GMT+0800 (China Standard Time)

All sounds reasonable and it looks like a logical evolution of the project. It would be great if we could decrease coupling and make the project more modular and re-usable for other frameworks. The only thing we should keep in mind I think is we need to preserve performance.

I'm thinking if cobol-reader can provide generic methods that would allow conversion to Spark's Row in the reader as a single step, without an intermediate data structure. I'm also thinking of adding performance tests to make sure performance won't degrade after the re-arrangement.

Tiago Requeijo · Answer 5 · Sun Mar 29 2020 09:00:21 GMT+0800 (China Standard Time)

@yruslan, take a look at https://github.com/tr11/cobrix/tree/refactor-readers, it does what I mentioned in #184 (comment).
With these changes we have:

parser, reader, spark packages. Each handles one part of the process
the row conversion happens as a single step with no intermediate structures
it's possible to call the reader directly and extract data as a nested Seq or Array.

I should have a JSON record builder at some point in the next couple of weeks. Probably as an example app as there's no need to add extra dependencies to the reader.

Ruslan Yushchenko · Answer 6 · Sun Mar 29 2020 19:20:49 GMT+0800 (China Standard Time)

I've looked through the changes and it looks perfect! Will give it a more thorough look tomorrow, I might have a couple of questions.

Tiago Requeijo · Answer 7 · Mon Mar 30 2020 09:38:47 GMT+0800 (China Standard Time)

I added a test with a potential JSON, XML, and CSV implementation for record builders. A few questions I have:

Should the reader package be merged into the parser? There are no extra dependencies and it's unlikely anyone would use the parser without a reader.
Is there a need to create a serializers package similar to the spark-cobol package? The test I mentioned uses a Map[String, Any] to hold the data and pass it to jackson-databind. Maybe an example could suffice instead of replicating a lot of what's done on the Spark side.

Ruslan Yushchenko · Answer 8 · Tue Mar 31 2020 03:34:53 GMT+0800 (China Standard Time)

Looking at it...

Should the reader package be merged into the parser? There are no extra dependencies and it's unlikely anyone would use the parser without a reader.

Yes, I agree. It can be in the same module, but in different packages. If for any reason we would like to split it, we could do it any time.

Is there a need to create a serializers package similar to the spark-cobol package? The test I mentioned uses a Map[String, Any] to hold the data and pass it to jackson-databind. Maybe an example could suffice instead of replicating a lot of what's done on the Spark side.

I think, yes here as well. Direct conversion to JSON, XML and CSV might be very useful as something supported by the library, than just an example. Even if it is a very small module. I expect it can be used to bridge EBCDIC-encoded IBM MQ messages with other messaging systems.

Tiago Requeijo · Answer 9 · Tue Mar 31 2020 03:50:26 GMT+0800 (China Standard Time)

I'll let you go through it first and will merge the parser and reader modules after you're done.

For 2, what do you think of cobol-serializer name for the new package? I can set up the Reader classes akin to what's done in the spark side right now and then we can think about how to pass options to those.

Ruslan Yushchenko · Answer 10 · Tue Mar 31 2020 04:35:17 GMT+0800 (China Standard Time)

cobol-serializer seems alright, but gives an impression that something can be serialized to Cobol. Maybe cobol-converters? It implies that Cobol data can be converted to various sources. The fact that serializers are used for the conversion can be considered a technical detail. What do you think?

Tiago Requeijo · Answer 11 · Tue Mar 31 2020 04:47:23 GMT+0800 (China Standard Time)

Good point, cobol-converters it is!

Tiago Requeijo · Answer 12 · Thu Apr 09 2020 04:11:57 GMT+0800 (China Standard Time)

Would it be useful to create a PR with these changes for comments and suggestions?

Ruslan Yushchenko · Answer 13 · Tue Apr 14 2020 17:11:40 GMT+0800 (China Standard Time)

👍 Of course

Sorry for the late response.