common-workflow-language / cwljava

Java SDK for the Common Workflow Language standards

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Selecting an implementation

kellrott opened this issue · comments

I've refactored both WIP in branches into more organized projects. #4 for the branch and @denis-yuen #5 for the branch by @pgrosu
#4 compiles, but it looks like #5 has issues with method overriding (I had a similar issue with the scala attempt)

ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.3:compile (default-compile) on project core: Compilation failure
[ERROR] /Users/ellrott/workspaces/cwljava/core/src/main/java/io/cwl/schema/OutputRecordSchema.java:[51,30] getfields() in io.cwl.schema.OutputRecordSchema cannot override getfields() in io.cwl.schema.RecordSchema
[ERROR] return type io.cwl.schema.OutputRecordField[] is not compatible with io.cwl.schema.RecordField[]

I'll take a look at the branch for #4. I may move it into a new branch inside the repo (from your organization) so I can work on it (saw some build errors on travis-ci to resolve) without forking it. I've added you as a collabroator on this project so you shouldn't need to fork in the future.

FYI, I've fixed the tests and the travis-CI build with #4
I saw that #5 was going to develop and #4 to master, are we going with HubFlow? https://datasift.github.io/gitflow/IntroducingGitFlow.html 👍

Until we decide on which way we're going, we should have both in different branches. It'll make it easier to to development and comparisons. The assignment of one to master and the other to develop was arbitrary.

I'd prefer to stick closer to the existing software (ie the Avro tools) rather then be responsible for our own build. However, I do like that @pgrosu included a build project to make the code generation part of the source tree.

@pgrosu's version is pretty streamlined, but the pure Avro version does include some additional builder methods and some additional attached meta-data about the scheme which will enable easier introspection and object unpacking.

For comparison, you can look at
https://github.com/common-workflow-language/cwljava/blob/develop/core/src/main/java/io/cwl/avro/InputParameter.java
vs
https://github.com/common-workflow-language/cwljava/blob/develop/WIP/paul/sdk/InputParameter.java

Hi Kyle (@kellrott),

Thank you for all the great help on this exciting project! Here are just a few things:

  1. It seems like you encountered the same issue, I previously discovered. I submitted the following request last week on the CWL group e-mail list – titled Request a minor change for OutputRecordField definition - and Peter (@tetron) is looking into it:

https://groups.google.com/forum/#!topic/common-workflow-language/ZgU0W3_eahY

My preference - as you also share - is to always have a seamless and streamlined connection between the design and implementation, since then the SDKs will follow naturally. The SDK can also be used to test the design – as you also encountered - which makes the process a nice biconditional.

I recommended in your pull-request a few small fixes to the compile-and-run.sh script, in order to address this, since the definition now resides in Process.yml. I also copied the AUTHORS.TXT file in the schema-build directory. After these changes, when I launch ./compile-and-run.sh everything works fine. It will also compile manually if you go to the core/src/main/java folder, and perform the compilation for any file, such as:

javac io/cwl/schema/OutputRecordField.java

Or for a more complete test, this can be done via a script that performs the following:

#!/bin/bash
for JavaFile in io/cwl/schema/*.java
do
  echo "Compiling: $JavaFile"
  javac $JavaFile
done

After these fixes, what other things can we improve upon? Folks can jump in here, with recommendations to hash out the process to the community’s preference. This now leads me to 2.

  1. Based on the highly demanded request from the CWL community for a Java SDK implementation, the version I put together fulfills the following three key goals:
  • The Java SDK implementation should be a Java-only implementation, including the parsing YAML documents and SDK code generation.
  • All necessary CWL design components should be available in the SDK, such as namespaces.
  • The SDK should have as close to a 1-to-1 mapping to the CWL design, so that one can go back and forth as seamlessly as possible.

Before we perform the merge, it might be good now as a community - i.e. @mr-c, @tetron, @mdmiller53, us here, and anyone else that wants jump in that I forgot - and probably start to decide which path to take. My preference is the following:

  1. Wait until @tetron looks at the implementation of the OutputRecordField definition in Process.yml. Based on that, we can then discuss and address how the SDK should operate.

  2. After the above is decided upon, then we probably might want more folks to look/test before the merge. Do we want a voting system before merging?

Having been at companies that put together software frameworks, I think the earlier we address issues we find, the easier and clearer things will be down the road.

Let me know what you think.

Thank you,
Paul

Hi,

My preference generally is to re-use maintained code where possible unless a compelling reason not to comes up and to break up multi-functional components into "small sharp tools."

Here, there are four main tasks that I think our two approaches demonstrate:

  1. converting from the CWL specification (schema salad yml) to standard avro json schema
  2. compiling the schema into Java classes
  3. converting from actual CWL tools and workflows to standard avro json documents
  4. deserializing those documents into instances of Java classes

I rely upon cwltool to perform 1. and 3., I rely upon avro tools to perform 3., and I have a bit of custom code in the SDK built on gson to perform 4. The advantage of this approach is that we get the full benefit of being in the avro ecosystem (an avro plugin for maven, our auto-generated classes will improve as Avro evolves, and we get the rest of the avro tooling for free) while cwltool and avro tools are maintained for "free".

The disadvantage of creating a tool that goes directly from the CWL specification to Java classes is that only a developer that is familiar with yaml/CWL/schema salad and Java will be able to maintain and evolve that tool.

That said, its not entirely clear to me that we have to choose. As long as we agree on an interface, we could just have both approaches.

We can't really 'agree on an interface', because there is very little that can be done avro generated code (unless we stack in a lot of post processing, but that kind of defeats the purpose). The question is, would @pgrosu move his generator close to what the Avro one creates.

I'm hoping we can started on the file parser soon. Which means we should settle this quickly. But maybe we should hear from a few other people like @ntijanic, @tetron and @mr-c

Can someone generate an API diff or point me to some javadocs to view side
by side?

Most of you are probably aware that GA4GH is moving off of Avro to protocol
buffers (which have a json serialization). I'm not advocating the same
change for CWL due to the power of our support for linked data but this
gives me pause about Avro.
Of course I want to leverage as much existing infrastructure and tooling as
possible. I will look into this further.

Vin, 15 ian. 2016, 03:16, Kyle Ellrott notifications@github.com a scris:

We can't really 'agree on an interface', because there is very little that
can be done avro generated code (unless we stack in a lot of post
processing, but that kind of defeats the purpose). The question is, would
@pgrosu https://github.com/pgrosu move his generator close to what the
Avro one creates.

I'm hoping we can started on the file parser soon. Which means we should
settle this quickly. But maybe we should hear from a few other people like
@ntijanic https://github.com/ntijanic, @tetron
https://github.com/tetron and @mr-c https://github.com/mr-c


Reply to this email directly or view it on GitHub
#6 (comment)
.

Michael R. Crusoe CWL Community Engineer crusoe@ucdavis.edu
mcrusoe@msu.edu
Common Workflow Language project University of California, Davis
https://impactstory.org/MichaelRCrusoe http://twitter.com/biocrusoe

I probably misinterpreted something about the scope of what people want to accomplish.

@kellrott I think in its simplest form, you'd only need two methods. One to serialize (not in my list so far) CWL documents and one to de-serialize (3 and 4 in my list) CWL documents.

For a file parser, are you referring to something other than de-serialization (i.e. creating instances of the generated classes in memory)?

Yes, @denis-yuen, the first step is to make sure we have methods de-serialize CWL documents into the classes generated by the schema. If you have something in the works, awesome.
Once that is done, we can start building the java cwltool to open the documents with an input and start building the command lines.
I had done some work to get the scala cwl test to run under the CWL compliance test (and fail ;-), but once we have that working, we can start unit testing command line generation compliance.

Let's have a zoom.us call with interested parties sometime early next week. If you want to participate please fill out this Doodle poll:

http://doodle.com/poll/xpuw24ky6p7y8i7h

Narrowly, @kellrott deserialization (of CommandLineTools) is tested here https://github.com/common-workflow-language/cwljava/blob/develop/core/src/test/java/CWLClientTest.java#L75

More broadly, I think I better understand from the call that there's more than one topic here. I was envisioning a rather small SDK to read and write CWL files. Somewhat like how https://commons.apache.org/proper/commons-csv/ can read and write CSV files. I do have the opinion there that we should rely upon publicly available tooling as much as possible to convert between file formats.

From the call, I understand that the broader idea is to essentially reimplement cwltool to pass the conformance tests for running workflows and other functions. I don't have much of an opinion on that wider discussion.

@denis-yuen the larger goal would be a complete implementation, including example tool runner and library that could be used in JVM projects, like https://github.com/broadinstitute/cromwell
But the project should be put into coherent sub-modules, and one of them could just be the core classes and basic IO. Then the tool runner could be another sub-module.

That makes sense.

@mr-c, I completed the JavaDoc for my SDK implementation, with the SDK included. People can download the zipped file at the following link:

pgrosu-CWL_JavaDoc_and_SDK.zip

When you unzip the files, you will see the following two folders:

  • JavaDoc
  • SDK

If you proceed to the JavaDoc folder, and double-click on the index.html file, the whole JavaDoc for SDK will be shown in a browser.

@denis-yuen, @kellrott, et al.: I sort of expected new enhancement requests to come with time. Thus I built my implementation in order to provide high-versatility for making easy additions to the code, which I am hoping to do together as a community. As you noticed it is highly modular, and the end-goal of having something similar to Cromwell (from the Broad) is practical. As a Java-only implementation, the Scala-only SDK will follow naturally.

Thank you for a great meeting today, and for having this specific call next week. I put my availability on the Doodle document.

In case I forgot anything, please let me know.

Thanks,
Paul

hi all,

a lot of great work. just a couple of observations from a webapp java workflow implementation i did a few years ago here at ISB. although there was an underlying model, i generated the equivalent java (essentially) bean classes by hand. my 'language' was captured by spring XML configuration files that were remarkably similar to the json avro CWL schema (i would have happily used CWL if it existed at the time) and these would instantiate my java beans. to actual gather parameters for the workflows, i would serialize the java beans using xstream's json option. this could be shipped client side and used by javascript to create forms for that workflow. when the user specified parameter values, that would be persisted in the json client side then shipped back to the server, where the values were merged into the server side saved json persisted workflow. these could then be reinstatiated into the java beans and the workflow run serverside as java

Hi Michael,

Those are awesome suggestions! That brings back nice memories of a discussion we had on the flexibility of XML cross-references and their connections to graph theory. We should be able to stream these settings and UI instantiations very easily, especially with JavaScript integration. The whole ecosystem should be modular to allow components to talk to each other (i.e. distributed engine(s), client(s), SDK framework, etc.)

With a flexible SDK I can even imagine workflows with the ability to be paused (checkpointed) in the middle of processing and later restarted, with future implementations. I posted this on the GA4GH containers list a while ago, where the flow of inputs, outputs and processing always goes through a SinkBuffer so as to be able to pause the processing, load-balance, and continue the processing on another node or at a later time.

The more ideas the better, and hope a lot of folk join this call tomorrow.

Paul

@pgrosu As a likely consumer of a java/scala based SDK I just wanted to comment on the java-vs-scala angle. If I were you all I wouldn't worry about maintaining both and just go with the java one. As a scala zealot it'd be trivial to write a little wrapper to scala it up.

My concern would be that unless the scala sdk was written by hand instead of autogenerated that it wouldn't be very scala-y anyways (e.g. vars, nulls, etc).

@pgrosu
i won't be able to make the call, but i'd be willing at some point to give a demo of my workflow implementation. it indeed had the possibility of pausing. i imagine people have looked at the workflow implementation Pegasus, it has many cool run time features and flexibility

@mdmiller53, no worries and apologize to reply so late as it has been a busy past few days. Please, you are more than welcome to :) I'm not sure if we looked at Pegasus but looking over the APIs across the languages, it's nice to see the same consistent names.

@geoffjentry Absolutely, less-is-more and the more we simplify the the core common functionality the easier it will be to port. I think a SDK specification documentation will help with getting consistent functionality across languages for their CWL SDK. If you prefer I start one on the implementation, or you can jump in :) It'll be fun to work with everyone together on the project.