A PyTorch implementation of Doc2Vec (distributed memory) with similarity measure.
torch2vec requires:
- Python (>= 3.6)
- torch (>=1.6.0)
- numpy
- tqdm
- pandas
- scikit-learn
- Clone the repository
git clone https://github.com/DeviantPadam/torch2vec.git
- Go to repository directory
cd torch2vec/
- Run
pip install -U .
- Make sure your data is in the correct format as mentioned in example_data/example.csv.
- Import modules
from torch2vec.data import DataPreparation
from torch2vec.torch2vec import DM
- Now load your data for preprocessing.
data = DataPreparation(corpus_path='example_data/example.csv',vocab_size)
vocab_size
: (optional)can be used to restrict vocabulary size (less frequent words will be dropped). - Now create vocabulary using
data.vocab_builder()
- Now get the doc_ids, context words, target words for further use
doc, context, target_noise_ids = data.get_data(window_size,num_noise_words)
window_size
: is the number of surrounding words.
num_noise_words
: is the number of words to be negative sampled.
-
Initialize the model
model = DM(vec_dim=100,num_docs=len(data),num_words=data.vocab_size)
vec_dim
: Dimensions of documents vector -
Now train the model
model.fit(doc_ids=doc,context=context,target_noise_ids=target_noise_ids,epochs=5,batch_size=1000,num_workers=2)
doc_ids,context,target_noise_ids
: can be obtained using data.get_data
epochs
: number of epochs
batch_size
: batch size
num_workers
: (default=1) Number of concurrently running workers.(max=os.cpu_count()) -
Now fit your real documents ids to doc embeddings and save the model(optional)
model.save_model(ids=data.document_ids,file_name='weights')
file_name
: (optional) if None then model will not save. -
Now get similar document ids
model.similar_docs('doc_id',topk=10,use='torch')
topk
: (default=10) Get 'topk' numbers of similar docs
use
: 'torch' or 'sklearn' (deafault='torch')
returns: similar ids and cosine similarity score of topk elements.(only similar ids if use='sklearn') -
If model is saved (stored as .npy file) then model can be reused without training using
from torch2vec.torch2vec import LoadModel
model = LoadModel(path='weights.npy')
Reusing:model.similar_docs('doc_id',topk=10,use='torch')
- Distributed Representations of Sentences and Documents Quoc V. Le, Tomas Mikolov
- https://github.com/inejc/paragraph-vectors
- Notes on Noise Contrastive Estimation and Negative Sampling, C. Dyer
- Document Embedding with Paragraph Vectors Andrew M. Dai, Christopher Olah, Quoc V. Le