iesahin / xvc

A robust (🐢) and fast (🐇) MLOps tool for managing data and pipelines in Rust (🦀)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`xvc data`

iesahin opened this issue · comments

This is a tracking issue for xvc data and subcommands.

  • This is to add metadata and labels to files, make queries.

  • xvc data label KEY=VALUE <targets>: Add a label to a set of file targets. This one adds a single label. There may be multiple labels for each file at the end. Labeling may be an event like store events. There may be implicit labels, like updated=<timestamp> for the metadata we track with XvcMetadata.

  • xvc data attach <text-file> <targets>: Attach a file containing metadata (or labels, annotations, searchable text) to targets. If the text file is JSON, YAML, or TOML, it can be parsed as labels. It must have a definite structure though. A dictionary in any of these formats is OK.

  • xvc data query --select [path, name, label,...] --from <targets> --where QUERY: This lists the asked info about targets that satisfy query.

  • QUERY can be a complex query that satisfies AND, OR, NOT operators, (), ==, =~ (regex match) operators. For numerics, we can also add numerical operations.

  • --select and --from are optional in xvc data query. It's possible to write xvc data query QUERY to run a query over all data.

  • xvc data query --name can be used to give a name to the query and run it later as a target, with --from. e.g. xvc data query --where 'class ~= .*berry' --name berries and can be used later as xvc data query --select name --from berries.

  • xvc data operations can be run from data by supplying an additional --query option that runs the query. xvc data move --query berries berries/

  • We also need a xvc data/file export <targets> command to copy a set of files to outside of the workspace. This can be used to create directories that contain subsets of data.

  • We may also need move, copy, remove commands to xvc data/file to make subsets of the dataset and update its metadata.

  • xvc data commands will run xvc file operations after running the query. xvc file won't operate with queries or metadata,
    xvc data will. The difference of these commands are this.

xvc data query

  • --select should have some implicit columns. ['path', 'filename', 'created', 'updated', ...]. labels will show all labels.

  • --from should accept files, directories, and globs. It will walk select similar to xvc file list, and run the query on these elements.

  • --where accepts a single query. The query language can be:

    • JMESPath with its Rust implementation. https://docs.rs/jmespath/latest/jmespath/
    • JSONPath
    • jaq can be embedded as a query language. In this case, json documents can be more complex, but I'm not sure if this complexity is necessary.
    • A simple home made query language similar to Bash / Python or something.
      • labels IN [strawberry, blueberry]
      • created < 2020-12-12 12:12:12
      • changed > 2022-10-31 00:00:00
      • name ~= .*berry
      • path *= images/*/*berry.png
      • (labels ~= .*berry) AND (created >= 2020-12-12 00:00:00)
        I think this last option may be more flexible. It can contain queries in sql or jql or some other ql with other operators.
      • (attached JAQ '.[][key] == value')
        an MVP version can be built by field OP value and other features (parens, logical ops, etc.) can be added later.