Spec2Vec workflow -- part 2
florian-huber opened this issue · comments
Florian Huber commented
[edit 28/04] adapted to recent updates.
I now roughly divide our current iomega/spec2vec workflow into different parts which I will work on separately to see how much for each of those parts is already in place, and what is still missing.
- Pre-processing (see #141) -- Import spectra form mgf and do cleaning, filtering, selection...
- Calculate Spec2Vec similarities -- Convert spectrums to documents, train word2vec model, calculate similarities.
- Calculate reference similarities -- Convert inchi/smiles to molecular fingerprints, calculate similarities based on fingerprints.
- Library matching -- Query few specifically selected spectrums against large reference set of spectrums.
2. Calculate Spec2Vec similarities
Code
# Start with cleaned, filtered set of spectra --> reference_spectrums_positive
import gensim
from matchms import calculate_scores
from matchms.similarity.spec2vec import Spec2Vec, SpectrumDocument
from matchms.similarity.spec2vec.Spec2Vec_utils import EpochLogger # not yet part of repo
documents = [SpectrumDocument(s) for s in reference_spectrums_positive]
# Create and train model
learning_rate_initial = 0.025
learning_rate_decay = 0.00025
iterations = 15
min_alpha = learning_rate_initial - iterations * learning_rate_decay
if min_alpha < 0:
min_alpha = 0
epochlogger = EpochLogger(iterations)
model = gensim.models.Word2Vec([d.words for d in documents], sg=0, negative=5,
size=200, window=300, min_count=1, workers=4,
iter=iterations, alpha=learning_rate_initial,
min_alpha=min_alpha, seed=321, compute_loss=True,
callbacks = [epochlogger])
# Save trained model
model.save('model_spec2vec.model')
# Calculate similarity matrix (just testing...)
spec2vec = Spec2Vec(model=model, documents=documents)
similarities = calculate_scores(documents[:10], documents[:10], spec2vec).scores
What's still missing
- Epoch-logger -- depending on the number of epochs, the number of documents, and the size of the documents, model training can take a while. Some type of logger is really needed to see that the process is working fine. For now I just used the following:
from gensim.models.callbacks import CallbackAny2Vec
class EpochLogger(CallbackAny2Vec):
"""Callback to log information about training progress.
Used to keep track of gensim model training"""
def __init__(self, num_of_epochs):
self.epoch = 0
self.num_of_epochs = num_of_epochs
self.loss = 0
def on_epoch_end(self, model):
"""Return progress of model training"""
loss = model.get_latest_training_loss()
# loss_now = loss - self.loss_to_be_subed
print('\r',
' Epoch ' + str(self.epoch+1) + ' of ' + str(self.num_of_epochs) + '.',
end="")
print('Change in loss after epoch {}: {}'.format(self.epoch+1, loss - self.loss))
self.epoch += 1
self.loss = loss
- Possibility to return spectrum vectors (I tried adding this in #165).
- Test to assess if models trained elsewhere can be imported and used properly (in the past that caused issues!).
- Add word weighing to spectrum vector calculation (done --> #165 )
- Switch to array based similarity score calculation (#163 ).
Jurriaan H. Spaaks commented
@florian-huber Am I right in saying this issue can be closed? As far as I can tell we did everything except the logger. If you still think that is needed, can you give it a dedicated issue? Thanks!
Florian Huber commented
Thanks @jspaaks . I will indeed close this issue (and move the logger to a separate more specific issue).