IlyaGusev / rnnmorph

Morphological analyzer for Russian and English languages based on neural networks and dictionary-lookup systems.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implementation in a loop clogs up memory

molokanov50 opened this issue · comments

There is a need for me to determine grammatical case for terms in texts of a big dataset. I found that the increment of memory usage as large as 0.3 to 0.7 MB occurs virtually every call of
forms = predictor.predict(terms).
Consider a simple example:

def findCase(termNumber, text):  # нахождение падежа термина с указанным номером в тексте
    terms = text.split()
    forms = predictor.predict(terms)
    myTag = forms[termNumber].tag
    parts = re.split('\\|', myTag)
    for part in parts:
        subparts = re.split('=', part)
        if len(subparts) < 2:
            continue
        if subparts[0] == 'Case':
            return subparts[1].upper()
    return 'UNDEF'

And then, if I have a collection of texts, i can implement:

myDict = {}
for i in range(len(texts)):
    case = findCase(0, texts[i])
    myDict[i] = case

I have 12500 texts with average length of about 700 symbols each. Running all my dataset required me extra 1.5 GB of memory due to utilizing predictor.predict(terms).
Seems like my local variable forms remains in the memory after completing the method, but really, is your RNNMorphPredictor model maybe self-trained in this scenario? How to free this volume of memory?

Update: there is no obvious difference depending on the length of every single text. I reduced input text length down to 10 tokens, or approx. 80 symbols only. Memory usage is the same - 1.5 GB per 12500 texts. Thereby my question becomes even more actual.