๐ Oxen
Oxen helps you version your machine learning datasets like you version your code.
Versioning datasets can be slow and painful
Built from the ground up for speed, Oxen is 10-100x faster than using git or other tooling built on top of git.
The shift to Software 2.0 is happening where we are replacing lines with machine learning models and large datasets. Software is already complex, without the complexity of machine learning in the mix. We need better tooling to keep track of changes as data and models evolve over time.
Features
๐ฅ Fast (10-100x faster than existing tools)๐ง Easy to learn (same commands as git)๐๏ธ Index lots of files (millions of images? no problem)๐ฅ Handles large files (images, videos, audio, text, parquet, arrow, json, models, etc)๐ Native DataFrame processing (oxen df command for data exploration)๐ Tracks changes over time (never worry about losing the state of your data)๐ค Collaborate with your team (sync to an oxen-server)๐ Better data visualization on OxenHub
Sign up here for more information and to stay updated on the progress.
Why the name Oxen?
"Oxen"
Overview
No need to learn a new paradigm.
The Oxen Command Line Interface (CLI) mirrors git in many ways, so if you are comfortable versioning code with git, it will be straightforward to version your datasets with Oxen.
Watch as we commit hundreds of thousands of images to an Oxen repository in a matter of seconds
Installation
For Mac Users
$ brew tap Oxen-AI/oxen
$ brew install oxen
For other platforms follow the installation instructions.
Basic Commands
Here is a quick overview of common commands translated to Oxen.
Setup User
For your commit history, you will have to set up your local Oxen user name and email. This is what will show up in oxen log
or in the OxenHub dashboard for who changed what.
$ oxen config --name "YOUR_NAME" --email "YOUR_EMAIL"
Create Local Repository
First, create a new directory, navigate into it, and perform
$ oxen init
Stage Data
You can stage changes that you are interested in committing with the oxen add
command and giving a full file path or directory.
$ oxen add images/
View Status
To see what data is tracked, staged, or not yet added to the repository you can use the status
command.
Note: since we are dealing with large datasets with many files, status
rolls up the changes and summarizes them for you.
$ oxen status
On branch main -> e76dd52a4fc13a6f
Directories to be committed
added: images with added 8108 files
Files to be committed:
new file: images/000000000042.jpg
new file: images/000000000074.jpg
new file: images/000000000109.jpg
new file: images/000000000307.jpg
new file: images/000000000309.jpg
new file: images/000000000394.jpg
new file: images/000000000400.jpg
new file: images/000000000443.jpg
new file: images/000000000490.jpg
new file: images/000000000575.jpg
... and 8098 others
Untracked Directories
(use "oxen add <dir>..." to update what will be committed)
annotations/ (3 items)
You can always paginate through the changes with the -s
(skip) and -l
(limit) params on the status command. Run oxen status --help
for more info.
Commit Changes
To commit the changes that are staged with a message you can use
$ oxen commit -m "Some informative commit message"
Log
You can see the history of changes on your current branch by running:
$ oxen log
commit 6b958e268656b0c5
Author: Ox
Date: Fri, 21 Oct 2022 16:08:39 -0700
adding 10,000 training images
commit e76dd52a4fc13a6f
Author: Ox
Date: Fri, 21 Oct 2022 16:05:22 -0700
Initialized Repo ๐
Reverting To Commit
If ever you want to change your working directory to a point in your commit history, you can simply supply the commit id from your history to the checkout
command.
$ oxen checkout COMMIT_ID
Restore Working Directory
The restore
command comes in handy if you made some changes locally and you want to revert the changes. This can be used for example if you accidentally delete or modify or stage a file that you did not intend to.
$ oxen restore path/to/file.txt
Restore defaults to restoring the files to the current HEAD. For more detailed options, as well as how to unstage files refer to the restore documentation.
Advanced Features
Oxen has many more advanced features such as computing diffs between tabular data as well as convenient DataFrame manipulation through the oxen df command.
Feel free to skip down to the more advanced features.
Sharing Data and Collaboration
There are two ways you can collaborate on your data with Oxen.
- Using the OxenHub platform
- Self-hosting using the oxen-server binary
The easiest route is to sign up for an account on OxenHub and sync your data to a repository there.
Create an account
Visit https://www.oxen.ai/register to register
Your Repositories
From your home page, you can view your repositories and create a new repository.
Setup Authorization
You will notice on the side panel you have access to your API Key. In order to push data to your repository you will need to copy this key and set it up in your user config. This saves your key in ~/.oxen/user_config.toml
with one key per host if you ever need to push to multiple hosts.
$ oxen config --auth hub.oxen.ai YOUR_API_KEY
$ cat ~/.oxen/user_config.toml
Create Remote Repository
Pick a name and give your repository a description. Repositories can be public for anyone to view, or private just for you and your company.
Push your data
Once you have created a repository, you will see a URL you can push your data to in the format https://hub.oxen.ai/<username>/<repo_name>
From the data repository that you created above you can simply add the remote and push.
$ oxen set-remote origin https://hub.oxen.ai/<username>/<repo_name>
$ oxen push origin main
Now you can set up your training job or another collaborator on your team to use your data by cloning it and pulling the branch you want.
$ oxen clone https://hub.oxen.ai/<username>/<repo_name>
$ cd <repo_name>
$ oxen pull origin main
Self Hosting
Oxen enables self-hosting with the oxen-server
binary. You do not get any of the UI features of the hub, but this is a nice option to kick the tires or set up internal infrastructure. Some teams set up a server instance in their local network and use it simply as backup and version control, others set it up in the cloud to enable sharing across data centers.
You can read more about self-hosting here.
Diving Deeper
Data Point Level Version Control
Oxen is smart about what file types you are adding. For example, if you track a tabular data file (with an extension .csv
, .tsv
, .parquet
, .arrow
, .jsonl
, or .ndjson
) Oxen will index and keep track of each row of data.
$ oxen add annotations/train.csv
$ oxen commit -m "adding rows and rows of data"
Under the hood, Oxen will detect the data schema and hash every row of content. This allows us to build a content addressable DataFrame to track the changes to the rows and columns over time. To learn more about the power of indexing DataFrames check out the data point level version control documentation.
Oxen also has some handy command line tooling for Exploratory Data Analysis with DataFrames. The oxen df
command lets you easily view, modify, slice, and modify the data.
$ oxen df annotations/train.csv
shape: (10000, 6)
โโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโ
โ file โ label โ min_x โ min_y โ width โ height โ
โ --- โ --- โ --- โ --- โ --- โ --- โ
โ str โ str โ f64 โ f64 โ f64 โ f64 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโชโโโโโโโโโชโโโโโโโโโชโโโโโโโโโชโโโโโโโโโก
โ images/000000128154.jpg โ cat โ 0.0 โ 19.27 โ 130.79 โ 129.58 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโค
โ images/000000544590.jpg โ cat โ 9.75 โ 13.49 โ 214.25 โ 188.35 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโค
โ images/000000000581.jpg โ dog โ 49.37 โ 67.79 โ 74.29 โ 116.08 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโค
โ images/000000236841.jpg โ cat โ 115.21 โ 96.65 โ 93.87 โ 42.29 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโค
โ ... โ ... โ ... โ ... โ ... โ ... โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโค
โ images/000000257301.jpg โ dog โ 84.85 โ 161.09 โ 33.1 โ 51.26 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโค
โ images/000000130399.jpg โ dog โ 51.63 โ 157.14 โ 53.13 โ 29.75 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโค
โ images/000000215471.jpg โ cat โ 126.18 โ 71.95 โ 36.19 โ 47.81 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโค
โ images/000000251246.jpg โ cat โ 58.23 โ 13.27 โ 90.79 โ 97.32 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโดโโโโโโโโโดโโโโโโโโโดโโโโโโโโโดโโโโโโโโโ
To learn more about what you can do with tabular data in Oxen you can reference the documentation here
Integrating Labeling Tools
For most supervised learning projects you will have some sort of annotation or labeling workflow. There are some popular open-source tools such as Label Studio for labeling data that can integrate with an Oxen workflow.
For an example of integrating Oxen into your Label Studio workflow, check out our Oxen Annotation Documentation.
Diff
If you want to see the differences between your file and the conflicting file, you can use the oxen diff
command.
Oxen knows how to compare text files as well as tabular data between commits. Currently, you must specify the specific path to the file you want to compare the changes.
If the file is tabular data oxen diff
will show you the rows that were added or removed.
$ oxen df annotations/data.csv --add_row 'images/my_cat.jpg,cat,0,0,0,0' -o annotations/data.csv
$ oxen diff annotations/data.csv
Added Rows
โญโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโโฎ
โ file โ label โ min_x โ min_y โ width โ height โ
โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโผโโโโโโโโผโโโโโโโโผโโโโโโโโโค
โ images/my_cat.jpg โ cat โ 0 โ 0 โ 0 โ 0 โ
โฐโโโโโโโโโโโโโโโโโโโโดโโโโโโโโดโโโโโโโโดโโโโโโโโดโโโโโโโโดโโโโโโโโโฏ
1 Rows x 6 Columns
If the tabular data schema has changed oxen diff
will flag and show you the columns that were added.
$ oxen df annotations/data.csv --add_col 'is_fluffy:unknown:str' -o annotations/data.csv
$ oxen diff annotations/data.csv
Added Cols
shape: (10001, 1)
โโโโโโโโโโโโโ
โ is_fluffy โ
โ --- โ
โ str โ
โโโโโโโโโโโโโก
โ unknown โ
โโโโโโโโโโโโโค
โ unknown โ
โโโโโโโโโโโโโค
โ unknown โ
โโโโโโโโโโโโโค
โ unknown โ
โโโโโโโโโโโโโค
โ ... โ
โโโโโโโโโโโโโค
โ unknown โ
โโโโโโโโโโโโโค
โ unknown โ
โโโโโโโโโโโโโค
โ unknown โ
โโโโโโโโโโโโโค
โ unknown โ
โโโโโโโโโโโโโ
Schema has changed
Old
+------+-------+-------+-------+-------+--------+
| file | label | min_x | min_y | width | height |
| --- | --- | --- | --- | --- | --- |
| str | str | f64 | f64 | f64 | f64 |
+------+-------+-------+-------+-------+--------+
Current
+------+-------+-------+-------+-------+--------+-----------+
| file | label | min_x | min_y | width | height | is_fluffy |
| --- | --- | --- | --- | --- | --- | --- |
| str | str | f64 | f64 | f64 | f64 | str |
+------+-------+-------+-------+-------+--------+-----------+
If the file is any other type of text data, it will simply show you the added and removed lines.
$ oxen diff path/to/file.txt
i
+here
am a text file that
+I am modifying
-la-dee-da
+la-doo-da
+another line
Branching
Branches are used to augment the dataset and run experiments with different subsets, transformations, or extensions of the data. The main
branch is the default branch when you start an Oxen repository. Use different branches while you run your experiments, and when you are confident in a dataset, merge it back into the main
branch.
You can create a new branch with
$ oxen checkout -b branch_name
Switch back to main
$ oxen checkout main
and delete the branch again
$ oxen branch -d branch_name
If you want to make the branch available to others, make sure to push it to a remote
$ oxen push origin branch_name
To see all the available branches you have locally run
$ oxen branch -a
Pulling New Changes
To update your local repository to the latest changes, run
$ oxen pull origin branch_name
Again you can specify the remote and the branch name you would like to pull
Merging the changes
If you feel confident in your changes, you can check out the main branch again, then merge your changes in.
$ oxen checkout main
$ oxen merge branch_name
If there are conflicts, Oxen will flag them and you will need to add and commit the files again in a separate commit. Oxen currently does not add any modifications to your working file, just flags it as conflicting. If you simply want to take your version, just add and commit again.
$ oxen add file/with/conflict.jpg
$ oxen commit -m "fixing conflict"
Dealing With Merge Conflicts
Oxen currently has three ways to deal with merge conflicts.
- Take the other person's changes
oxen checkout file/with/conflict.jpg --theirs
, then add and commit. - Take the changes in your current working directory (simply have to add and commit again)
- Combine tabular data
oxen checkout file/with/conflict.csv --combine
If you use the --combine
flag, oxen will concatenate the data frames and unique them based on the row values.
Support
If you have any questions, comments, suggestions, or just want to get in contact with the team, feel free to email us at hello@oxen.ai