CrossR / rLog

A CLI tool for logging scientific workflows.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

rLog

Build Status

rLog is a tool for logging the output of commands for scientific workflows.

I find myself very often running some long-running command to produce simulation output or similar, where the output depends very heavily on the current state of my environment.

Having the automated ability to run a set of commands, and store their output means that I can log out not only the result of a command, but the state of the environment at the time of the simulation.

For example:

  • echo out environment variables to check library versions.
  • git commands to get the current commit and diff of the repository.
  • Dumping out the current contents of a configuration file.
  • Status of input files.

Configuration

Configuration is achieved with a config.json in ~/.config/rLog/:

{
    "outputPath": "~/rLogOut",
    "logCommands": [],
    "linkCommands": [],
    "loadLocalCommands": false,
    "valuesToLog": ["@LOG@: "]
}

Where:

  • outputPath is the folder log files will be stored in.
  • logCommands is the additional logging commands to run.
  • linkCommands is the commands to run when calling rLog link.
  • localLocalCommands enables the ability to use a per project config file. This will cause rLog to load a local config.json, if there is one in the current directory. It will also attempt to get one from the git root, if one exists. Any values in the logCommands and valuesToLog will be combined, but the outputPath will be overridden. The override order is Current directory > Project > Global, so the current folder config has the highest priority.
  • valuesToLog is a list of strings that appear at the start of output that is especially important to log. For example echo "@LOG@: Cosmics disabled" will cause Cosmics disabled to appear in the metadata log file separately, so it is more easily found. These values can be printed in anyway, as long as they appear in the terminal stdout/stderr. This must be explicitly enabled by passing -p or --parse to the CLI, since it can slow down commands considerably.

Command variables

Both logCommands and linkCommands can contain variables which will dynamically be replaced.

Variable Replacement Command
${CWD} The directory rLog was opened in. Both
${GITROOT} The git root of the ${CWD} if applicable. If not, this command will not run. Both
${OUTPUTFILE} The passed output file(s). If multiple files are given, the command is invoked multiple times. Link
${OUTPUTFILES} The full list of passed files. Link
${COMMAND} The full passed command. Run

Example Config

{
    "outputPath": "~/rLogOut",
    "logCommands": [
       "git -C ~/git/simulator status",
       "git -C ~/git/simulator diff",
       "git -C ${GITROOT} status",
       "git -C ${GITROOT} diff",
       "env",
       "bash ~/scripts/parseConfigFiles.sh ${COMMAND}"
    ],
    "linkCommands": [
       "cksum ${OUTPUTFILE}",
       "bash ~/scripts/parseOutputFile.sh ${OUTPUTFILE}",
       "echo ${OUTPUTFILES} >> ~/logs/totalOutputFiles.txt"
    ],
    "loadLocalCommands": false,
    "valuesToLog": ["@LOG@: "]
}

In this example, when I use rLog I'll get the current status of my simulator, as well as the state of the git repository, both of which can play a big part in why an output looks the way it does.

The environment is fully logged out, to pick up the library versions and such that are setup with environment variables, and a script is ran on the full command that will parse the config file out and store the status of the config file that was used.

After running, the files are passed through a quick checksum, parsed further and also stored in a secondary log file.

Usage

Usage should just be rLog -- commandToRun, to run commandToRun and have it logged to the location outlined in the config file, with the logCommands ran to produce metadata for that command.

Running rLog --help will show the command line flags and their usage.

rLog link -- output1.txt can be used to link to the most recent run. Link in this context will create a symlink to a new output file, which will run any of the linkCommands, as well as giving an easy way of accessing the command output and metadata output.

rLog genconfig will generate a default config file in ~/.config/rLog (or in a custom location if --config-path is given.) This can be useful for making multiple configs, so that aliases can be setup for certain workflows. That is, using something like alias pLog=rLog --config-path ~/configs/pConfig.json means pLog can now be used to automatically use a project specific config, if a project spans multiple repositories, so using a project local config is not an option.

ToDo

Stuff to do:

  • Make a release, so I have a binary.
  • Return the ran commands status code.
  • Implement the project specific config.
  • Add command to get a link to the most recent config (so then I can do do_simulation.sh and then rLog link and have a link to the log file in the data location too).
  • Add a variable syntax to the commands (so I can define dynamic commands, that depend on the env/command ran, not just static ones).
  • Add a search/list command for listing the ran commands.
  • Check if I can get this running on Windows (not a priority since none of my workflow that benefits from this is on Windows...).

Thanks

Thanks to both

For their existing CLI apps that I used heavily to help fix things I couldn't get working! Especially around making a static binary.

About

A CLI tool for logging scientific workflows.


Languages

Language:OCaml 86.3%Language:C++ 6.6%Language:Shell 6.2%Language:Dockerfile 0.8%