Hironsan / anago

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Home Page:https://anago.herokuapp.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OOM with IndexedSlices Convertion

WenYanger opened this issue · comments

System information

  • OS Platform and Distribution: Linux Ubuntu 16.04
  • TensorFlow/Keras version: 1.8.0 & 2.1.6
  • Python version: 3.5.0
  • Anago Version: 1.0.6

Describe the problem

OOM(Out of memory) Error occured with a Warning :

UserWarning: Converting sparse IndexedSlices to a dense Tensor with 577296600 elements. This may consume a large amount of memory.

However,a highly similar data could run on the same code. The size and format of data are all the same, looks like this:

Text:   [['AAA', 'BBB', 'CCC'], ['AAA', 'BBB', 'CCC', 'DDD]]
Label: [['1', '0', '1'], ['1', '0', '1', '0']]

I wonder which step in my code (or data) lead to such Warning, because another similar data haven't raised this Warning ~ T.T

Article on StackOverFlow said it is caused by TensorFlow function tf.gather(). Maybe it is the issue?

https://stackoverflow.com/questions/45882401/how-to-deal-with-userwarning-converting-sparse-indexedslices-to-a-dense-tensor

Source code / logs

print('Loading Data')
corpus_k = pickle.load(open('../data/keywords_cleaned_100.pkl', 'rb'))
corpus_c = pickle.load(open('../data/corpus_cleaned_100.pkl', 'rb'))


if os.path.exists('../data/y_keyword_retrival.pkl'):
    y = pickle.load(open('../data/y_keyword_retrival.pkl', 'rb'))
else:
    y = []
    for i in range(corpus_c.shape[0]):
        if i % 1000 == 0: print(i)
        t1 = corpus_k[i]
        t2 = corpus_c[i]

        s1 = set(t1)
        l = []
        for word in t2:
            if word in s1:
                l.append('1')
            else:
                l.append('0')
        y.append(l)
    y = np.array(y)
    pickle.dump(y, open('../data/y_keyword_retrival.pkl', 'wb+'))

vec = gensim.models.word2vec.Word2Vec.load('../data/w2v_0428')

weights_file = './model_weights.h5'
params_file = './params.json'
preprocessor_file = './preprocessor.json'

print('Train Test Split ... ')
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(corpus_c, y, test_size=0.1, random_state=42)

print('Training ... ')
import anago
model = anago.Sequence(
    word_lstm_size=300,
    word_embedding_dim=300,
    embeddings=vec,
    use_char=False
)
model.fit(X_train, y_train, batch_size=256, epochs=5)
s = model.score(X_test, y_test)
model.save(weights_file, params_file, preprocessor_file)

Same issue, any hints?

Same issue, any hints?

No idea, bro.

Same issue, any hints?

No idea, bro.

If this is still relevant to you. I found an issue with very long documents and words for my problem at hand. The current code is padding sequences and tokens based on the longest sequence and token within the current batch. So if you for example have a token with length 1000 in the batch, all tokens get padded to that size which can increase the necessity of memory allocation heavily.

One solution is to change the padding code in the package, or simpler, just do some pre-processing on your data to only create sequences of length X and tokens of length Y.