Need suggestion about embeddings

Question

Need suggestion about embeddings

shubhamagarwal92 opened this issue 5 years ago · comments

I am trying to use elmo embeddings from allennlp and need some suggestion.

In the for loop of __getitem__ before you convert it to indices, I also save the raw_question

dialog[i]["raw_question"] = dialog[i]["question"] # Tokenized

which could then be converted to char_ids and elmo_emb

        
        
        ques_char_ids = batch_to_ids([dialog_round["raw_question"] for dialog_round in dialog])
        ques_elmo_emb = self._elmo_wrapper(ques_char_ids)


    def _elmo_wrapper(self, char_ids, max_sequence_length = None):
        # Refer: https://github.com/allenai/allennlp/issues/2659
        """
        Parameters
        ----------
        char_ids : torch.Tensor
            char ids of the raw sequences

        Returns
        -------
        torch.Tensor
            Tensor of sequences padded to max length

        """
        if not max_sequence_length:
            max_sequence_length = self.config["max_sequence_length"]
        # with torch.no_grad():
        #     elmo_seq = self.elmo(char_ids)['elmo_representations'][0]
        # elmo_seq = self.elmo(char_ids)['elmo_representations'][0].requires_grad_(False)
        elmo_seq = self.elmo(char_ids)['elmo_representations'][0].detach()
        batch_size, timesteps, emb_dim  = elmo_seq.size()
        if timesteps > max_sequence_length:
            elmo_emb = elmo_seq[:, :max_sequence_length, :]
        else:
            # Pad zeros
            zeroes_size = max_sequence_length - elmo_seq.size(1)
            zeros = torch.zeros(batch_size, zeroes_size, emb_dim).type_as(elmo_seq)
            elmo_emb = torch.cat([elmo_seq, zeros], 1)

        return elmo_emb

However the training gets too slow. Do you have any experience with elmo and suggest why it is happening?

I think one of the possible workaround is to extract and save the embeddings as a pre-processing step. Could you share your data generation scripts please.

Karan Desai · Answer 1 · Mon Jul 29 2019 11:08:29 GMT+0800 (China Standard Time)

I have used ELMo in the past, but mostly have pre-extracted the embeddings and use them (fixed). Consider it analogous to extracting FC7 features from an image using some CNN. I would recommend extracting them all to an H5 file and write a separate reader class / incorporate them in DialogsReader.

Refer this to see how embeddings for the whole dataset can be extracted to an H5 file. You will need to do it just once:

https://allenai.github.io/allennlp-docs/api/allennlp.commands.elmo.html

The input file is previously tokenized, whitespace separated text, one sentence per line.

You could use something like " ".join(word_tokenize(question)) and dump the list of questions, answers, and captions to separate text files. And then use those text files to run AllenNLP's command and get the H5 files.

Since the tokenization strategy is same word_tokenize, the number of elements per sequence would match with the pre-processing done on the fly.

I think it makes sense to go down this path because the tokenization strategy (word_tokenize) may not change too much (unless you do something fancy like Byte-Pair Encoding). Even if it does, it's fairly straightforward to generate similar H5 files (and easily switch their usage through configs, say). Consider it analogous to having different sets of features from different types of CNNs / detectors.

Karan Desai · Answer 2 · Mon Jul 29 2019 11:09:48 GMT+0800 (China Standard Time)

I think one of the possible workaround is to extract and save the embeddings as a pre-processing step. Could you share your data generation scripts please.

We pre-process nothing beforehand (except the image-features). All the pre-processing code is right in the reader. I guess you got that now, and this question is from a while back. The issue is quite dated, I am not sure how I missed this thread, sorry about that :(

Shubham Agarwal · Answer 3 · Wed Jul 31 2019 00:22:06 GMT+0800 (China Standard Time)

I am a bit confused when you say We pre-process nothing beforehand (except the image-features). on the other hand you have pre-extracted the embeddings! Could you share your dataset reader for this, particularly the part where you manage the history.

Karan Desai · Answer 4 · Sun Aug 04 2019 15:12:50 GMT+0800 (China Standard Time)

I am a bit confused when you say
We pre-process nothing beforehand (except the image-features). on the other hand you have pre-extracted the embeddings!

Putting my response from our mail here for completeness:

ELMo embeddings are not supported in this public release, whereas extracting fixed embeddings should be fine for performance gain. Think of it as extracting image features from a pre-trained CNN/detector. :-)