kellyjonbrazil / jello

CLI tool to filter JSON and JSON Lines data with Python syntax. (Similar to jq)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature request: support for other input formats (like raw string, yaml, csv,...)

g-v-egidy opened this issue · comments

Currently jello is fixed on reading json and json-lines.

Sometimes it would be helpful for me to be optionally able to directly read other formats too. For example yaml, toml, csv, raw strings,...

For many of these there are cli converters available, but sometimes these converters lack options or have complicated dependencies. So it would be nice to have that option integrated in jello.

Suggested option: -p <input format> Pipe input format name.

[Edit:] maybe better -F <input format> because "Pipe" suggests that it would just work for stdin. But it should of course also work for reading from a file with -f.

This could maybe be implemented using python-benedict as suggested in #62 . benedict already has importers for many formats that would make sense in this context: https://github.com/fabiocaccamo/python-benedict#io-methods benedict of course uses several libraries for this and they could of course also be directly called by jello without using benedict for it.

If you don't like this idea because the preferred way is to use commandline converters that should be piped before jello, ok. But then please at least consider adding a raw string input mode that stores the whole pipe input into _ as string. That would allow the user to write python code to further parse/convert the input.

Interesting idea. So this does change jello from being only a JSON manipulation tool to something more general. I do need to think about this a bit. I'll check out python-benedict. I think I looked at it in the past but I haven't taken a look in a while.

I have added the -R (raw input) option in dev which allows you to skip the string-to-dict/list conversion so _ is just a raw string. That way you can import benedict or anything else to manipulate the data. You can add the import to your initialization file so it's always available via -i and you won't need to import in the query.

Let me know if you are able to test and this works for you. I think this is a happy medium so we don't add too many dependencies into jello but it still allows the flexibility of using other libraries.

Here is an example of loading YAML data into jello:

% jello -R 'import yaml;_ = yaml.safe_load(_)' -f values.yaml 
{
  "var1": "value1",
  "var2": "value2",
  "text": "Here a text\nthat i like to write like this on multiple line.\nIt will be an HTML text so i’ll add <br> for line return.\nand here i finish"
}

So it sounds to me like the issue for you is mostly dependencies, not the scope of jello or similar, correct?

I also consider dependencies an important issue to consider: better have a jello running as it is now than not being to able to run a more fancy jello at all because some library is not packaged for your distribution.

But how about using optional dependencies then?

  • drop the idea of using benedict for this, jello would directly use the required libraries when necessary
  • add the -F <input format> option
  • the python module required for parsing a format is only imported when the data format is actually used. So when using just json no other libraries are imported by the code at all, so no issue if the libraries are not installed.

This is how this could look like (without error handling and niceties for brevity):

def load_yaml(data):
    import yaml
    yaml_dict = yaml.safe_load(data)
    return yaml_dict

def load_toml(data):
    import toml
    toml_dict = toml.loads(data)
    return toml_dict

[...]

    if opts.format == "yaml":
        data = load_yaml(data)
    elif opts.format == "toml":
        data = load_toml(data)
    elif opts.format == "raw":
        data = str(data).rstrip('\r\n')
    else:
        data = load_json(data)

The advantage would be that the jello query will contain less boilerplate for importing and converting.

If you think this is a good alternative I can offer to implement this and raise a pull request for you to review.

This could work. I was a little concerned with the standard library's YAML support. That's why I went with ruamel.yaml (https://pypi.org/project/ruamel.yaml/) instead in jc. Also TOML is not in the standard library until python 3.11, though we could vendor in the tomli library like I did for jc. (https://pypi.org/project/tomli/). Python's CSV support is pretty good, so that could be added pretty easily.

But then we are getting into a gray area where we are re-implementing the functionality of jc. The more "unixy" way would be to do something like this:

$ cat file.yaml | jc --yaml | jello

So as you can see, I feel a bit conflicted about this. :)

I thougt if i it were just like 3 to 5 lines of python code for each of the load_* functions the code duplication with jc would be ok.

But I just took a short look at the libraries and didn't do a deeper investigation, like python version compatibility, licenses and similar yet. You have done this for jc. And things like vendoring in a toml library does sound like a lot more code duplication and additional maintenance long time than just 5 lines of code.

So I think in this case it is better to pipe things through jc then.

Thank you for implementing the raw string mode and discussing these options with me.