`xvc data`
iesahin opened this issue · comments
This is a tracking issue for xvc data
and subcommands.
-
This is to add metadata and labels to files, make queries.
-
xvc data label KEY=VALUE <targets>
: Add a label to a set of file targets. This one adds a single label. There may be multiple labels for each file at the end. Labeling may be an event like store events. There may be implicit labels, likeupdated=<timestamp>
for the metadata we track withXvcMetadata
. -
xvc data attach <text-file> <targets>
: Attach a file containing metadata (or labels, annotations, searchable text) to targets. If the text file is JSON, YAML, or TOML, it can be parsed aslabel
s. It must have a definite structure though. A dictionary in any of these formats is OK. -
xvc data query --select [path, name, label,...] --from <targets> --where QUERY
: This lists the asked info about targets that satisfy query. -
QUERY
can be a complex query that satisfiesAND
,OR
,NOT
operators,()
,==
,=~
(regex match) operators. For numerics, we can also add numerical operations. -
--select
and--from
are optional inxvc data query
. It's possible to writexvc data query QUERY
to run a query over all data. -
xvc data query --name
can be used to give a name to the query and run it later as a target, with--from
. e.g.xvc data query --where 'class ~= .*berry' --name berries
and can be used later asxvc data query --select name --from berries
. -
xvc data
operations can be run from data by supplying an additional--query
option that runs the query.xvc data move --query berries berries/
-
We also need a
xvc data/file export <targets>
command to copy a set of files to outside of the workspace. This can be used to create directories that contain subsets of data. -
We may also need
move
,copy
,remove
commands toxvc data/file
to make subsets of the dataset and update its metadata. -
xvc data
commands will runxvc file
operations after running the query.xvc file
won't operate with queries or metadata,
xvc data
will. The difference of these commands are this.
xvc data query
-
--select
should have some implicit columns.['path', 'filename', 'created', 'updated', ...]
.labels
will show all labels. -
--from
should accept files, directories, and globs. It will walk select similar toxvc file list
, and run the query on these elements. -
--where
accepts a single query. The query language can be:- JMESPath with its Rust implementation. https://docs.rs/jmespath/latest/jmespath/
- JSONPath
- jaq can be embedded as a query language. In this case, json documents can be more complex, but I'm not sure if this complexity is necessary.
- A simple home made query language similar to Bash / Python or something.
labels IN [strawberry, blueberry]
created < 2020-12-12 12:12:12
changed > 2022-10-31 00:00:00
name ~= .*berry
path *= images/*/*berry.png
(labels ~= .*berry) AND (created >= 2020-12-12 00:00:00)
I think this last option may be more flexible. It can contain queries in sql or jql or some other ql with other operators.(attached JAQ '.[][key] == value')
an MVP version can be built byfield OP value
and other features (parens, logical ops, etc.) can be added later.