String Embedding

In this project, we design and implement a deep learning model, which transforms strings into real number vectors while preserving their neighboring relation. Specifically, if the edit distance of two strings x and y is small, the L2-distance of their embeddings should also be small. With this model, we can transform expensive edit distance computation to cheaper L2-distance computation and speed up string similarity search.

start training

python main.py --dataset word --nt 1000 --nq 1000 --epochs 20 --save-split --recall

optional arguments:

  -h, --help            show this help message and exit
  --dataset             dataset name which is under folder ./data/
  --nt                  # of training samples
  --nr                  # of generated training samples
  --nq                  # of query items
  --nb                  # of base items
  --k                   # sampling threshold
  --epochs              # of epochs
  --shuffle-seed        seed for shuffle
  --batch-size          batch size for sgd
  --test-batch-size     batch size for test
  --channel CHANNEL     # of channels
  --embed-dim           output dimension
  --save-model          save cnn model
  --save-split          save split data folder
  --save-embed          save embedding
  --random-train        generate random training samples and replace
  --random-append-train generate random training samples and append
  --embed-dir           embedding save location
  --recall              print recall
  --embed EMBED         embedding method
  --maxl MAXL           max length of strings
  --no-cuda             disables GPU training

fjsj / string-embed

String Embedding

start training

optional arguments:

About

Languages