chinese-characters clustering confusables unicode

ML Confusables Generator

A pair of confusables is a pair of characters which might be used in spoofing attacks due to their visual similarity (for example ‘ν’ and ‘v’). The wide range of characters supported by Unicode poses security vulnerabilities. Security mechanisms listed in UTS#39 (UTS #39) use confusable data (https://www.unicode.org/Public/security/latest/confusables.txt) to combat such attacks. The purpose of this project is to identify novel pairs of confusables using representation learning and custom distance metrics.

Getting Started

Prerequisite

Installation

Download and install Docker: Get Docker Here.
git clone and cd into git repository.
Make sure all submodules are updated: git submodule update --init --recursive.

Launch Jupyter Notebook Container

In project source folder, run ./scripts/start.sh.
In any browser, go to localhost:8888.
Copy the token from terminal to browser to access Jupyter Notebook.

Launch Command Line Environment Container

In project source folder, run ./scripts/start_cli.sh.
Execute setup script ./scripts/setup.sh.

Interactive Shell in Running Container

Run docker ps to get container id/name.
Run docker exec -it [CONTAINER_NAME/ID] /bin/bash.

Exit Docker Container

In Jupyter Notebook terminal, type ctrl + c.
In command-line interface, exit.

Usage

Han Script Confusable Generation

From link, download full_data.zip (pre-generated images) file and unzip in data/ directory.
From link, download full_data_triplet1.0_meta.tsv and full_data_triplet1.0_vec.tsv (pre-generated embeddings and labels) into embeddings/ directory.

Create representation clustering object:

from rep_cls import RepresentationClustering
rc = RepresentationClustering(embedding_file='embeddings/full_data_triplet1.0_vec.tsv',
                              label_file='embeddings/full_data_triplet1.0_meta.tsv',
                              img_dir='data/full_data/')

Generate confusables for specific chracter:

rc.get_confusables_for_char('褢')
>>> ['裹', '裏', '裛', '裏']

Full Walk-through

Check main.ipynb.

Pre-trained CNN model

From link, download TripletTransferTF (pre-trained model) folder into ckpts/ directory.

Source file generation

To regenerate source files, in source/ directory, run python generate_source_file.py.
To check how the source file is selected, see source/Radical-stroke_Index_Analysis.ipynb.

Repo Contents

Main Components

main.ipynb: Notebook for setting up, building and deploying confusable detector. Also serves as tutorial script.
vis_gen.py: Contains VisualGenerator, class for generating visualization of characters.
rep_gen.py: Contains RepresentationGenerator, class for generating representations (embeddings) used for clustering.
rep_cls.py: Contains RepresentationClustering, class for clustering representations and finding confusables.
distance_metrics.py: Contains Distance, factory class that defines distance metrics for different image format. Also contains enumeration class ImgFormat.

CNN Model Training Scripts

configs/sample_config.ini: Sample configuration for model training. To start your own training procedure, create new configuration file following the same format.
custom_train.py: Contains ModelTrainer, class that executes training procedure.
dataset_builder.py: Contains DatasetBuilder, class that invokes data pre-processing functions for TensorFlow dataset generation.
model_builder.py: Contains ModelBuilder, class that creates and initialize TensorFlow models.
data_preprocessing.py: Image pre-processing functions.

Dataset Source File

source/Radical-stroke_Index_Analysis.ipynb: Jupyter Notebook for radical-stroke analysis and dataset selection.
source/generate_source_file.py: Contains functions that produces the same result as Jupyter Notebook file.
source/charset_*k.txt: Selected Unicode code points.
source/randset_*k.txt: Randomly selected Unicode code points.
source/full_dataset.txt: Full dataset containing 21028 code points, used for clustering.

Shell Scripts

Expect all scripts to be executed in base directory. For example, ./scripts/start.sh instead of ./start.sh.

scripts/start.sh: Launch a Docker container with Jupyter Notebook.
scripts/start_cli.sh: Launch a Docker container with bash.
scripts/setup.sh: Should run inside the container, setting up the environment and install all packages.
scritps/install_fonts.sh: Install required fonts, included in setup.sh.
scripts/download_*.sh: Scripts for downloading pre-established data, model or embeddings from Google Drive.

Unit Tests

*_test.py: Run python [MODULE]_test.py for all the unit tests for [MODULE].py.

Utility functions (in `utils.py`)

calculate_from_path: Calculate distance between the two images specified by file path.
train_test_split: Split dataset (already created) into training and testing datasets.

Placeholder Folders

data/: Default visualization directory.
ckpts/: Default model directory.
embeddings/: Default embedding directory.

Testing

Expect all tests to be run under the CLI container setup.

Run All Unit Tests

In root folder, run python -m unittest discover -s . -p '*_test.py'.

Run Individual Unit Test

In root folder, run python [MODULE]_test.py

Copyright & Licenses

The project is released under LICENSE.

A CLA is required to contribute to this project - please refer to the CONTRIBUTING.md file (or start a Pull Request) for more information.

About

Generates confusables for Han script using ML techniques

chinese-characters clustering confusables unicode

Other

Languages

Language:Jupyter Notebook 93.7%Language:Python 6.2%Language:Shell 0.1%

unicode-org / ml-confusables-generator

ML Confusables Generator

Table of Contents

Getting Started

Prerequisite

Installation

Launch Jupyter Notebook Container

Launch Command Line Environment Container

Interactive Shell in Running Container

Exit Docker Container

Usage

Han Script Confusable Generation

Full Walk-through

Pre-trained CNN model

Source file generation

Repo Contents

Main Components

CNN Model Training Scripts

Dataset Source File

Shell Scripts

Unit Tests

Utility functions (in `utils.py`)

Placeholder Folders

Testing

Run All Unit Tests

Run Individual Unit Test

Copyright & Licenses

About

Languages

ML Confusables Generator

Table of Contents

Getting Started

Prerequisite

Installation

Launch Jupyter Notebook Container

Launch Command Line Environment Container

Interactive Shell in Running Container

Exit Docker Container

Usage

Han Script Confusable Generation

Full Walk-through

Pre-trained CNN model

Source file generation

Repo Contents

Main Components

CNN Model Training Scripts

Dataset Source File

Shell Scripts

Unit Tests

Utility functions (in utils.py)

Placeholder Folders

Testing

Run All Unit Tests

Run Individual Unit Test

Copyright & Licenses

About

Languages

Utility functions (in `utils.py`)