apicrafter / metacrafter

Metadata and data identification tool and Python library. Identifies PII, common identifiers, language specific identifiers. Fully customizable and flexible rules

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Metacrafter

Python command line tool and python engine to label table fields and fields in data files. It could help to find meaningful data in your tables and data files or to find Personal identifable information (PII).

Installation

To install Python library use pip install metacrafter via pip or python setup.py install

Features

Metacrafter is a rule based tool that helps to label fields of the tables in databases. It scans table and finds person names, surnames, midnames, PII data, basic identifiers like UUID/GUID. These rules written as .yaml files and could be easily extended.

File formats supported:

  • CSV
  • JSON lines
  • JSON (array of records)
  • BSON
  • Parquet
  • XML

Databases support:

  • Any SQL database supported by SQLAlchemy
  • NoSQL databases:
    • MongoDB

Metacrafter key features:

  • 111 labeling rules
  • all labels metadata collected into Metacrafter registry public repository
  • 312 date detection rules/patterns, date detection using qddate, "quick and dirty" date detection library
  • extendable set of rules using PyParsing, exact text match and validation functions
  • support any database supported by SQLAlchemy
  • advanced context and language management. You could apply only rules relevant to certain data of choosen language
  • built-in API server
  • commercial support and additional rules available

Command line examples

File analysis examples

# Scan CSV file
$ metacrafter scan-file --format short somefile.csv

# Scan CSV file with delimiter ';' and windows-1251 encoding
$ metacrafter scan-file --format short --encoding windows-1251 --delimiter ';' somefile.csv

# Scan JSON lines file, output results as stats table to file file
$ metacrafter scan-file --format stats -o somefile_result.json somefile.jsonl

Result example of 'full' type of formatting

key               ftype    tags    matches                                                                datatype_url
----------------  -------  ------  ---------------------------------------------------------------------  ----------------------------------------------------------
Domain            str              fqdn 99.90                                                             https://registry.apicrafter.io/datatype/fqdn
Primary domain    str              fqdn 100.00                                                            https://registry.apicrafter.io/datatype/fqdn
Name              str              name 100.00                                                            https://registry.apicrafter.io/datatype/name
Domain type       str      dict
Organization      str
Status            str      dict
Region            str      dict    rusregion 22.95                                                        https://registry.apicrafter.io/datatype/rusregion
GovSystem         str      dict
HTTP Support      str      dict    boolean 100.00                                                         https://registry.apicrafter.io/datatype/boolean
HTTPS Support     str      dict    boolean 100.00                                                         https://registry.apicrafter.io/datatype/boolean
Statuscode        str      dict
Is archived       str      empty
Archives          str      empty
Archive priority  str      dict
Archive Strategy  str      dict
ASN               str              asn 93.77                                                              https://registry.apicrafter.io/datatype/asn
ASN Country code  str      dict    countrycode_alpha2 100.00,countrycode_alpha2 100.00,languagetag 99.56  https://registry.apicrafter.io/datatype/countrycode_alpha2
IPs               str              ipv4 96.28                                                             https://registry.apicrafter.io/datatype/ipv4
GovType           str      dict



Database analysis examples

# Scan MongoDB database 'fns', save results as result.json and format output as 'stats'
$ metacrafter scan-mongodb --dbname fns -o result.json -f full

# Scan Postgres database 'dbname', with schema 'public'.
$ metacrafter scan-db --schema public --connstr postgresql+psycopg2://username:password@127.0.0.1:15432/dbname

Rules

All rules described as YAML files and by default rules loaded from directory 'rules' or from list of directories provided in .metacrafter file with YAML format

All rules could be applied to fields or data .

Compare engines defined in match parameter in rule description:

  • text - scan text for exact match to one of text values. Text values delimited by comma (',')
  • ppr - scan text for PyParsing. PyParsing rule defined as Python code with PyParsing objects like Word(nums, exact=4)
  • func - scan text using Python function provided. Function shoud accept one string parameter and shoud return True or False

How to write rules

Function (func)

Example Russian administrative legal act/law matched by custom function

  runpabyfunc:
    key: runpa
    name: Russian legal act / law
    maxlen: 500
    minlen: 3
    priority: 1
    match: func
    type: data
    rule: metacrafter.rules.ru.gov.is_ru_law

Exact text match (text)

Example midname matching by exact field name

  midname:
    key: person_midname
    name: Person midname by known
    rule: midname,secondname,middlename,mid_name,middle_name
    type: field
    match: text

PyParsing rule (ppr)

Example Russian cadastral number

  rukadastr:
    key: rukadastr
    name: Russian land territory cadastral identifier
    rule: Word(nums, min=1, max=2) + Literal(':').suppress() + Word(nums, min=1, max=2) + Literal(':').suppress() + Word(nums, min=6, max=7) + Literal(':').suppress() + Word(nums, min=1, max=6)
    maxlen: 20
    minlen: 12
    priority: 1
    match: ppr
    type: data

Detailed stats

Rule types:

  • field based rules 146
  • data based rules 102

Context:

  • common 47
  • companies 15
  • crypto 3
  • datetime 29
  • finances 5
  • geo 58
  • government 19
  • identifiers 3
  • industry 2
  • internet 18
  • medical 6
  • objectids 3
  • persons 19
  • pii 16
  • science 2
  • software 1
  • values 1
  • vehicles 1

Language:

  • common 100
  • de 4
  • en 24
  • es 1
  • fr 11
  • ru 108

Data/time patterns (qddate): 312

Commercial support

Please write ibegtin@apicrafter.io or ivan@begtin.tech to request beta access to commercial API. Commercial API support 195 fields and data rules and provided with dedicated support.

About

Metadata and data identification tool and Python library. Identifies PII, common identifiers, language specific identifiers. Fully customizable and flexible rules

License:Apache License 2.0


Languages

Language:Python 100.0%