varlen / noni

Minimalistic database anonymization tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

noni

Noni is a Data Anonymization Tool that enables creating an anonymized database from an existing database using synthetic data.

About

It's main use case is to create secure development databases from existing data. For example, when a company wants to provide a database for development purposes to third-parties without disclosing data.

It is composed of:

  • A database spec extractor that builds a specification file from the data characteristics and database structure
  • A database builder, that takes the specification file as input, creates tables and generates similar data

Currently, only Postgres databases without custom types are supported but it is possible to implement other SQL implementations.

Setup

Noni requires an external HTTP API providing semantic classification. One of such providers is a SATO fork, which is available here. Download the pretrained model available and follow the install instructions to run it.

To make the dependency management easier considering it uses an older Python version, an Open Container Image is avaliable in this repository, avoiding the need to install a specific python version and create a virtual env. For more information on SATO, see the original paper here.

A single command installation is pending.

Usage

Noni consists of two main Python applications: the extractor and the generator.

⚗️ Extractor

The extractor loads database information from environment variables. See scripts/extract.sh script for a reference on how to run the extraction.

✨ Generator

The connection string for the output database must be in the OUTPUT_DATABASE_URL environment variable.

To run the generator, run main.py script from the command line, passing the JSON file generated by the extraction plus --data and --structure parameters. These allows to toggle if data and/or database structure will be written during this generator execution.

cd noni/generator
python main.py ..\extractor\output.json --structure --data

💡 Improvements and Ideas

It is possible to replace SATO with any API that receives csv files and returns a JSON list of the semantic types of the columns, as long as the types are constrained to the type78 list of types.

About

Minimalistic database anonymization tool

License:MIT License


Languages

Language:PLpgSQL 51.3%Language:Python 47.2%Language:Shell 1.4%Language:Batchfile 0.2%