datacoon / datadifflib

Python library to track changes and generate deltas for JSON, CSV and BSON files.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

datadiff -- library and tool to compare data files JSON, CSV and BSON and to create and apply changes between dataset versions

travis build status pypi version Documentation Status

datadifflib is a Python 3 lib that helps track changes between two versions of dataset and to produce delta file of changes of these files. It supports JSON, BSON and CSV file formats and could produce delta files for each of these data formats.

Documentation

Documentation is built automatically and can be found on Read the Docs.

Features

  • As simple as possible
  • Minimalistic memory footprint
  • File formats supported: BSON, JSON, CSV

Limitations

  • Only JSON files supported to generate and apply delta files
  • Limited support for very huge files 100GB+, max tested files are 5GB
  • Files readed twice to generated delta. First time it generates index and second time it extracts added, deleted and changed records
  • The library and tool doesn't ever know anything about applicability of patch and so on. You have to manage yourself version control of datasets

Command-line tool

Usage: datadiffcli.py [OPTIONS] COMMAND [ARGS]...

Options:
--help Show this message and exit.

Commands: * compare Compares records in two files with unique key and returns if changes exists * delta Generates delta file * patch Applies patch from delta file

Examples

Compare two versions of same dataset with unique key defined in 'regnum' field in each dataset

python datadiffcli.py compare regnum reestrgp_2018.json reestrgp_2019.json

Generates delta file after comparsion of two versions of same dataset with unique key defined in 'regnum' field

python datadiffcli.py delta regnum reestrgp_2018.json reestrgp_2019.json reestrgp_delta.json

Apply delta file against original dataset and produce updated dataset

python datadiffcli.py patch reestrgp_2018.json reestrgp_delta.json reestrgp_proc.json

How to use library

Generates report on changes between 'reestrgp_2018.json' and 'reestrgp_2019.json' versions of dataset with unique key 'regnum'
>>> from datadiff.diff import jsondiff
>>> key = 'regnum'
>>> left = 'reestrgp_2018.json'
>>> right = 'reestrgp_2019.json'
>>> report = jsondiff(key, left, right)
Generates delta file between two versions of dataset
>>> from datadiff.delta import json_delta
>>> left = 'reestrgp_2018.json'
>>> right = 'reestrgp_2019.json'
>>> outfile = 'reestrgp_delta.json'
>>> json_delta(key, left, right, outfile, difftype='full')
Apply patch to first version of dataset
>>> from datadiff.delta import apply_json_delta
>>> dataset = 'reestrgp_2018.json'
>>> delta = 'reestrgp_delta.json'
>>> outfile = 'reestrgp_proc.json'
>>> apply_json_delta(key, dataset, delta, outfile)

Patch file format

Patch file is quite simple it's serialized json structure. Each record in 'records' field has fields: - mode - 'a' for add, 'c' for change and 'd' for delete - uniqkey - unique key of selected record - obj - original object value from original or compared dataset file

Unique copied outside 'obj' since in future obj could be replaced by patch to selected record, not record itself

About

Python library to track changes and generate deltas for JSON, CSV and BSON files.

License:MIT License


Languages

Language:Python 92.8%Language:Makefile 7.2%