akdel / torch2vec

A PyTorch implementation of Doc2Vec (distributed memory) with similarity measure.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

torch2vec - beta

A PyTorch implementation of Doc2Vec (distributed memory) with similarity measure.

Installation

Dependencies

torch2vec requires:

  • Python (>= 3.6)
  • torch (>=1.6.0)
  • numpy
  • tqdm
  • pandas
  • scikit-learn

User Installation

  1. Clone the repository git clone https://github.com/DeviantPadam/torch2vec.git
  2. Go to repository directory cd torch2vec/
  3. Run pip install -U .

User Instructions

Data preprocessing

  • Make sure your data is in the correct format as mentioned in example_data/example.csv.
  • Import modules
    from torch2vec.data import DataPreparation
    from torch2vec.torch2vec import DM
  • Now load your data for preprocessing.
    data = DataPreparation(corpus_path='example_data/example.csv',vocab_size)
    vocab_size: (optional)can be used to restrict vocabulary size (less frequent words will be dropped).
  • Now create vocabulary using data.vocab_builder()
  • Now get the doc_ids, context words, target words for further use
    doc, context, target_noise_ids = data.get_data(window_size,num_noise_words)
    window_size: is the number of surrounding words.
    num_noise_words: is the number of words to be negative sampled.

Training

  • Initialize the model
    model = DM(vec_dim=100,num_docs=len(data),num_words=data.vocab_size)
    vec_dim: Dimensions of documents vector

  • Now train the model
    model.fit(doc_ids=doc,context=context,target_noise_ids=target_noise_ids,epochs=5,batch_size=1000,num_workers=2)
    doc_ids,context,target_noise_ids: can be obtained using data.get_data
    epochs: number of epochs
    batch_size: batch size
    num_workers: (default=1) Number of concurrently running workers.(max=os.cpu_count())

  • Now fit your real documents ids to doc embeddings and save the model(optional)
    model.save_model(ids=data.document_ids,file_name='weights')
    file_name: (optional) if None then model will not save.

  • Now get similar document ids
    model.similar_docs('doc_id',topk=10,use='torch')
    topk: (default=10) Get 'topk' numbers of similar docs
    use: 'torch' or 'sklearn' (deafault='torch')
    returns: similar ids and cosine similarity score of topk elements.(only similar ids if use='sklearn')

  • If model is saved (stored as .npy file) then model can be reused without training using
    from torch2vec.torch2vec import LoadModel
    model = LoadModel(path='weights.npy')
    Reusing: model.similar_docs('doc_id',topk=10,use='torch')

References

Special thanks to Luc for helping and motivating me.

About

A PyTorch implementation of Doc2Vec (distributed memory) with similarity measure.

License:GNU General Public License v3.0


Languages

Language:Python 100.0%