torch2vec - beta

A PyTorch implementation of Doc2Vec (distributed memory) with similarity measure.

Installation

torch2vec requires:

Make sure your data is in the correct format as mentioned in example_data/example.csv.
Import modules
from torch2vec.data import DataPreparation
from torch2vec.torch2vec import DM
Now load your data for preprocessing.
data = DataPreparation(corpus_path='example_data/example.csv',vocab_size)
vocab_size: (optional)can be used to restrict vocabulary size (less frequent words will be dropped).
Now create vocabulary using data.vocab_builder()
Now get the doc_ids, context words, target words for further use
doc, context, target_noise_ids = data.get_data(window_size,num_noise_words)
window_size: is the number of surrounding words.
num_noise_words: is the number of words to be negative sampled.

Initialize the model
model = DM(vec_dim=100,num_docs=len(data),num_words=data.vocab_size)
vec_dim: Dimensions of documents vector
Now train the model
model.fit(doc_ids=doc,context=context,target_noise_ids=target_noise_ids,epochs=5,batch_size=1000,num_workers=2)
doc_ids,context,target_noise_ids: can be obtained using data.get_data
epochs: number of epochs
batch_size: batch size
num_workers: (default=1) Number of concurrently running workers.(max=os.cpu_count())
Now fit your real documents ids to doc embeddings and save the model(optional)
model.save_model(ids=data.document_ids,file_name='weights')
file_name: (optional) if None then model will not save.
Now get similar document ids
model.similar_docs('doc_id',topk=10,use='torch')
topk: (default=10) Get 'topk' numbers of similar docs
use: 'torch' or 'sklearn' (deafault='torch')
returns: similar ids and cosine similarity score of topk elements.(only similar ids if use='sklearn')
If model is saved (stored as .npy file) then model can be reused without training using
from torch2vec.torch2vec import LoadModel
model = LoadModel(path='weights.npy')
Reusing: model.similar_docs('doc_id',topk=10,use='torch')

A PyTorch implementation of Doc2Vec (distributed memory) with similarity measure.

GNU General Public License v3.0

Language:Python 100.0%