codemeta / codemeta

Minimal metadata schemas for science software and code, in JSON-LD

Home Page:https://codemeta.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Linking source code to software applications (entrypoints and service endpoints aka SaaS) with regard for interface type

proycon opened this issue · comments

Linking source code to software applications (entrypoints and service endpoints aka SaaS) with regard for interface type

The aim of this proposal is to:

  • explicitly specify the interface type(s) provided by software
  • make an explicit distinction and explicit link between software and software as a service
  • allow linking to software instances (services) from the source code metadata
  • relate SoftwareSourceCode and SoftwareApplication in both directions
  • make 'entry points' and 'service endpoints' explicit
  • settle some ambigious terms

Linking source code to application entrypoint and service endpoints

In #198, #229 and #246 it was discussed and subsequently decided to add
hasSourceCode to schema.org and codemeta; a good idea. I would propose we
also add a property that is the exact and unambigious reverse of this. I suggest
providesApplication = @reverse hasSourceCode.. There is also
targetProduct (#267) which has the same
domain and range, but there seems to be a lot of confusion what
targetProduct means exactly, schema.org defines it as: "Target Operating System / Product to which the code applies. If applies to several versions, just the product name can be used.")
. It is too vaguely defined and there is conflicting information in #267, #246 and #198.

Various aspects of what I propose here affect schema.org directly but I thought
it better to pass this through the codemeta community first.

The providesApplication property would allow explicitly linking from the
source code metadata to software applications. This make two things possible:

  1. This would provide a better means of expressing entry points for software, as I proposed earlier in #183 back in 2018. An entry point here is simply defined
    as an executable provided by a source code, each of which can be considered a schema:SoftwareApplication in their own right.
  2. Linking source code to service instances where the application is running (service endpoints). Each typically associated with an URL. Here the range would be:
    * schema:WebAPI (as proposed in schemaorg/schemaorg#1423 and worked out in schemaorg/schemaorg#2635) - Emphasis here is on the machine interface. The existing schema:EntryPoint also has a place in what they proposed here. Their proposal also covers linking to formal specifications (like OpenAPI/swagger).
    * schema:WebApplication - Emphasis here is on the human interface (web UI).
    * schema:WebPage - Emphasis here is on the human interface.
    * The domain of schema:hasSourceCode would also need to be extended to included all these three.

I think we can use providesApplication to cover both cases, but alternatively we could envision two properties (providesApplication vs providesService?)

Interface type

A software application offers one or more interfaces through which users or
machines can interact with it. I'd like to make this information explicit. When
using providesApplication with schema:WebAPI/WebApplication/WebPage it
is already implied. For the more generic schema:SoftwareApplication it is
not. The specific types schema:MobileApplication, schema:VideoGame and
aforementioned schema:WebApplication already exist, but other interface
types are not covered yet. We could extend these with:

  • CommandLineApplication (command line interfaces)
  • DesktopApplication (Desktop GUIs)
  • TerminalApplication (Text UIs, think of vim,mutt and ncurses-based tools etc)
  • SoftwareDaemon (Software running as a daemon providing some kind of service over a network or local socket, think e.g. of ntpd, crond), this would be more generic than WebApplication (or WebAPI).
  • SoftwareLibrary (APIs, think of libraries, either in the form of shared-objects/dll/dylib or in the form of modules for interpreted languages like Python)

More specific types can be envisioned (relates to #256):

  • NotebookApplication (more specific form of WebApplication) - For Jupyter Notebooks and comparable technologies. Characterised by a mixture of text and code, often used in data science. May or may not be tied to a specific url where an intertactive instance is available (e.g link to binder/collab).
  • SoftwareImage - A software application in some kind of image form (such as an OCI container (e.g. Docker)), that typically ships the software with all its immediate dependency context. May or may not be tied to a specific url where the image is obtained (e.g. Docker Hub). Here the provided interface is relevant for operators (in a DevOps context) seeking to deploy the software in an infrastructure.
  • SoftwarePackage - The Software in some packaged form (e.g. for a particular linux distribution, homebrew, a Python wheel, etc). The difference between this and SoftwareImage would be that this packages only the software, and not its dependency context, the dependency context is assumed to be explicitly expressed in the package but is obtained from other packages within the same packaging context (whatever package distribution method that may be).

Alternatively, we could have an interfaceType property like I suggested in
#183, but as it seems there is already precedence in schema.org for doing it
with Types, so that might be the best way to follow.

An important point to consider is that a software application, even implemented
in a single executable, may provide multiple types. But assigning multiple
types is not an obstacle, correct me if I'm wrong, so that should be covered already.

Executable Name

In order to express entry points explicitly, it's important to list the exact
executable names, which are not necessarily identical to the name.
Alternatively, one may argue that schema:identifier suffices for this.

There is already a schema:executableLibraryName property (used in a
documentation context on APIReference). That could be reused for the
proposed SoftwareLibrary. But a more generic executableName would need
to be introduced for the others, and there's no real reason not to use that for
libraries as well. The executableName would be defined that what is
runnable (within a certain runtimePlatform context), it should not contain
platform-specific extensions like .exe,.so,.dylib,.dll but just
the name portion. For software libraries for platform like Python it would
correspond to the top-level module name that can be imported.

Such a property may also make sense directly on SoftwareSourceCode,
allowing for a more succint expression rather than needing to go via
providesApplication and the corresponding SoftwareApplication-subtypes.

Example

Consider the following example of a SoftwareSourceCode instance where the
codebase provides various interface types. (This software actually exists
though in reality it's not a single codebase that provides all these interfaces,
it's split into multiple repositories, but it would be conceivable someone does
it like this):

{
    "@type": "SoftwareSourceCode",
    "name": "Frog",
    "codeRepository": "https://github.com/LanguageMachines/frog",
    ...,
    "providesApplication": [
        {
            "type": "CommandLineApplication",
            "executableName": "frog",
            "name": "Frog",
            "runtimePlatform": "Linux"
        },
        {
            "type": "SoftwareLibrary",
            "executableName": "libfrog",
            "name": "Frog Library",
            "runtimePlatform": "Linux"
        },
        {
            "type": "SoftwareLibrary",
            "executableName": "frog",
            "name": "Frog Python Binding",
            "runtimePlatform": "Python"
        },
        {
            "type": "WebAPI",
            "provider": "Radboud Universiteit Nijmegen",
            "endpointUrl": "https://webservices.cls.ru.nl/frog",
            "endpointDescription": "https://webservices.cls.ru.nl/frog",
            "conformsTo": "https://clam.readthedocs.io/en/stable/",
            "documentation": "https://webservices.cls.ru.nl/frog/info",
            "contentType": "application/xml"
        },
        {
            "type": "WebApplication",
            "executableName": "frog-service",
            "provider": "Radboud Universiteit Nijmegen",
            "url": "https://webservices.cls.ru.nl/frog"
        }
    ]
}

Conclusion

I've tried to tie together some existing loose ends in this proposal, reusing
as much of the existing codemeta/schema vocabulary as possible and linking with
other existing proposals, keeping the amount of newly introduced vocabulary to
a minimum.

What this subsequently allows is expressing software metadata from multiple
perspectives, one may start with a codemeta.json and the source code as a
basis and produce a complete tree of software applications and service
instances that are provided by the source code. In a research context, there's
often a single institute bringing a web-demo of a certain research sofware
online, possibly for demo purposes. It makes sense to be able accommodate this
metadata directly from the codemeta.json in the source code root.

Moreover, this enables conversion of entrypoint metadata already present in
e.g. Python setup.py, to codemeta/schema.

For those who take the other perspective and express metadata as WebAPI or
WebPage or WebApplication first and foremost, this provides the means
to explicitly link it to the source code.

Apologies for the long post but I wanted to make sure to sketch a complete
picture, I'd be appreciative of any feedback. Most of this is probably more for
schema.org than codemeta but I wanted to discuss it here first and see what you
suggest. I'd also like to poke @dgarijo in this because I see he's been doing
some excellent work on formalizing things in the Software Description Ontology
and we have some overlap there (this touches upon #229 and #256).

@proycon thanks for the ping.
I like the categorization of software applications proposed, although adding providesApplication does not convince me because its overlap with targetProduct. In the end, it would confuse people to have 2 properties that are so similar. I also realized that there is no definition for this property above, so I don't think I know the difference myself.

I also like the part on entrypoints, which is crucial for usability. Sure, we should not go into describing all methods of an API, because that's what the spec do. But capturing metadata and examples at a basic level is crucial for discovery. In this regard, the work done by CWL is quite nice, although there is a risk to quickly run into a rabbit hole of complexity.

Thanks for your reaction! If we indeed agree that targetProduct refers to the software product created by the source
code
(as @cboettig wrote in #267, then I agree we can simply use that. Though there's ambiguity caused by schema.org's
description: Target Operating System / Product to which the code applies. If applies to several versions, just the
product name can be used."
which might lead to an interpretation more akin to runtimePlatform.

@proycon, yes, I think it can be misinterpreted. However, I think this could be clarified with documentation and examples.

What do you think is the best way of moving this forward? I think there are valuable things in this proposal to consider creating a profile for schema.org. Maybe this is too much for the current scope of codemeta.

This issue has been sitting for nearly two weeks without much interaction. Maybe we can create a separate repository and contribute towards defining application types, examples and minimal description for using them? I would like to incorporate some of this in the software description ontology, or the json-ld representation we create for software.

yes, I think it can be misinterpreted. However, I think this could be clarified with documentation and examples.

Agreed, I think reusing and clarifying targetProduct as you suggested is the best way forward.

This issue has been sitting for nearly two weeks without much interaction. Maybe we can create a separate repository and contribute towards defining application types, examples and minimal description for using them? I would like to incorporate some of this in the software description ontology, or the json-ld representation we create for software.

Yes, I understand both codemeta and schema.org are fairly slow moving targets. I hoped to get some input from the wider codemeta community to see if the route I was suggesting was shared by enough people, but I do agree with moving forward as soon as possible and getting some actual definitions, examples and descriptions out there. We have practical applications we (CLARIAH project) want to realize with this in a mere matter of months. In fact, I already made a first attempt/draft today at formalizing some things (see CLARIAH/tool-discovery@f03b425). You're probably more experienced on this than I am so I'm very open to collaborating on this, we could indeed create a separate shared repository. Do you have a suggestion? Do we then go for an issue + pull request at schema.org directly after?

It would be also nice to align this with the (long) anticipated codemeta 3 release.

@proycon regarding logistics:

  • I would create a separate repo (I don't mind where it lives, ideally we can transfer it to codemeta when it's ready for review). Maybe we can call it the "software_types" profile.
  • We should probably create a w3id for the profile (I can do that) to get the right context and content negotiation.
  • The profile should (IMO) only be in JSON-LD + examples + doc (which is how it was done for codemeta).

Regarding the software_types profile:

For each type, I know a few efforts that tackle part of its representation. Hydra for APIs (https://www.hydra-cg.com/spec/latest/core/), function ontology / CWL (or even a little of software description ontology) for command line invocations, etc. The challenge is to avoid reinventing them, and summarize them in something that would be simple enough for a schema.org representation. Ideally, each type should have at least 1 unique property that motivates its addition. And several use cases.

The PR to schema.org I do not see it happening unless there is a significant community behind. But this initial discussion could serve to identify additional gaps that could lead to an addition in codemeta. And, until then, at least it would be helpful for us as an additional profile that we could support if interested (I see myself adding an additional export in somef to support extended types.

Thoughts?

  • I would create a separate repo (I don't mind where it lives, ideally we can transfer it to codemeta when it's ready for review). Maybe we can call it the "software_types" profile.

Sounds good to me! "software_types" covers the idea well so makes sense as a name. I don't care much where it lives either. Perhaps at your KnowledgeCaptureAndDiscovery group since it's close to your core business? I could also put it under CLARIAH alternatively. Transferring it to codemeta eventually would be good if the community agrees.

  • We should probably create a w3id for the profile (I can do that) to get the right context and content negotiation.

Good idea

  • The profile should (IMO) only be in JSON-LD + examples + doc (which is how it was done for codemeta).

Ok, so no RDF schema yet? I was looking a bit at how schema.org was handling things where all of the pending extensions seem to be accompanied by a turtle file.

Regarding the software_types profile:

For each type, I know a few efforts that tackle part of its representation. Hydra for APIs (https://www.hydra-cg.com/spec/latest/core/), function ontology / CWL (or even a little of software description ontology) for command line invocations, etc. The challenge is to avoid reinventing them, and summarize them in something that would be simple enough for a schema.org representation. Ideally, each type should have at least 1 unique property that motivates its addition. And several use cases.

Hydra looks promising, are there are already existing efforts mapping that to/from OpenAPI/Swagger? It'd indeed good to reuse as much as possible and reintroduce as little new vocabulary as possible. Simplicity is also a concern, we don't want to force any unnecessary complexity on users.

The PR to schema.org I do not see it happening unless there is a significant community behind. But this initial discussion could > serve to identify additional gaps that could lead to an addition in codemeta. And, until then, at least it would be helpful for us as > an additional profile that we could support if interested (I see myself adding an additional export in somef to support extended
types.

Yes, I was also worried a PR to schema.org might take too long so that sounds like a good approach, having it as an extra vocabulary that people can use. I think it's important it comes with practical implementation quickly, so it can be put to the test immediately. Your suggestion to implement it in somef sounds great; I'll do the same for codemetapy and the new codemeta-harvester which in turn leverages these tools, so then we already cover quite some ground.

@proycon all right, I will create a repo and invite you so you we can iterate.

@proycon I created https://github.com/SoftwareUnderstanding/software_types and added you as a maintainer. I do not have much time this week, but I would like to get to this next week. In the meantime, please commit your initial proposal, as I would like to preserve authorship as much as possible.

@dgarijo Thanks! I have committed the initial proposal as it stands, we can continue from there next week!

I have implemented the ideas from this issue in codemetapy and released a major new version today (2.0). It uses targetProduct to link source code to application instances and uses the extra types from https://github.com/SoftwareUnderstanding/software_types to describe their types.

Additionaly, I just released two new tools based on codemetapy for which this functionaltiy was needed. (I opened a pull request for inclusion on the website (codemeta/codemeta.github.io#39) as well):

  • codemeta-harvester - A wrapper around codemetapy and other tools, provides a full automatic conversion pipeline to codemeta
  • codemeta-server - A webservice/webapplication to search and browse codemeta (offers a SPARQL endpoint etc)..

A live demo of this ensemble of codemeta software can currently be found here: https://tools.dev.clariah.nl/ (mind that it's still in development).

Very cool @proycon! Do you annotate the link between target product and source code manually or automatically? I have not seen this captured in repo readmes, so I was wondering how it's done.

@dgarijo Thanks! The link itself is captured manually (as software source code in a source repo by definition can't know exhaustively when/where it is deployed); codemeta-harvester reads a very simple YAML configuration file that points to the software source code and any service instances (see https://github.com/proycon/codemeta-harvester#usage-harvesting-metadata-for-various-projects). The end result is a codemeta file where the targetProduct is filled.

Hi all, I have added this issue as a discussion point towards v4.0 release.
I find it useful to have a property @reverse hasSourceCode.
In v3.0 we will be adding the hasSourceCode and I would like to have more participants to back providesApplication or counter semantic proposal to this reverse property.

Thank you @proycon for this proposal.

This issue is tagged v4.0.
We will be reviewing PRs for V4.0 by January 10th 2024.

Let me know if you want to propose something for v4.0.