Spec2Vec workflow -- part 2

Question

Spec2Vec workflow -- part 2

florian-huber opened this issue 4 years ago · comments

[edit 28/04] adapted to recent updates.

I now roughly divide our current iomega/spec2vec workflow into different parts which I will work on separately to see how much for each of those parts is already in place, and what is still missing.

Pre-processing (see #141) -- Import spectra form mgf and do cleaning, filtering, selection...
Calculate Spec2Vec similarities -- Convert spectrums to documents, train word2vec model, calculate similarities.
Calculate reference similarities -- Convert inchi/smiles to molecular fingerprints, calculate similarities based on fingerprints.
Library matching -- Query few specifically selected spectrums against large reference set of spectrums.

2. Calculate Spec2Vec similarities

Code

# Start with cleaned, filtered set of spectra --> reference_spectrums_positive
import gensim
from matchms import calculate_scores
from matchms.similarity.spec2vec import Spec2Vec, SpectrumDocument
from matchms.similarity.spec2vec.Spec2Vec_utils import EpochLogger  # not yet part of repo

documents = [SpectrumDocument(s) for s in reference_spectrums_positive]

# Create and train model
learning_rate_initial = 0.025
learning_rate_decay = 0.00025
iterations = 15

min_alpha = learning_rate_initial - iterations * learning_rate_decay
if min_alpha < 0:
    min_alpha = 0

epochlogger = EpochLogger(iterations)
model = gensim.models.Word2Vec([d.words for d in documents], sg=0, negative=5,
                               size=200, window=300, min_count=1, workers=4,
                               iter=iterations, alpha=learning_rate_initial,
                               min_alpha=min_alpha, seed=321, compute_loss=True,
                               callbacks = [epochlogger])

# Save trained model
model.save('model_spec2vec.model')

# Calculate similarity matrix (just testing...)
spec2vec = Spec2Vec(model=model, documents=documents)
similarities = calculate_scores(documents[:10], documents[:10], spec2vec).scores

What's still missing

Epoch-logger -- depending on the number of epochs, the number of documents, and the size of the documents, model training can take a while. Some type of logger is really needed to see that the process is working fine. For now I just used the following:

from gensim.models.callbacks import CallbackAny2Vec


class EpochLogger(CallbackAny2Vec):
    """Callback to log information about training progress.
    Used to keep track of gensim model training"""

    def __init__(self, num_of_epochs):
        self.epoch = 0
        self.num_of_epochs = num_of_epochs
        self.loss = 0

    def on_epoch_end(self, model):
        """Return progress of model training"""
        loss = model.get_latest_training_loss()
        # loss_now = loss - self.loss_to_be_subed
        print('\r',
              ' Epoch ' + str(self.epoch+1) + ' of ' + str(self.num_of_epochs) + '.',
              end="")
        print('Change in loss after epoch {}: {}'.format(self.epoch+1, loss - self.loss))
        self.epoch += 1
        self.loss = loss

Possibility to return spectrum vectors (I tried adding this in #165).
Test to assess if models trained elsewhere can be imported and used properly (in the past that caused issues!).
Add word weighing to spectrum vector calculation (done --> #165 )
Switch to array based similarity score calculation (#163 ).

Jurriaan H. Spaaks · Answer 1 · Mon May 18 2020 18:50:18 GMT+0800 (China Standard Time)

@florian-huber Am I right in saying this issue can be closed? As far as I can tell we did everything except the logger. If you still think that is needed, can you give it a dedicated issue? Thanks!

Florian Huber · Answer 2 · Mon May 18 2020 18:53:26 GMT+0800 (China Standard Time)

Thanks @jspaaks . I will indeed close this issue (and move the logger to a separate more specific issue).