lingpy / linse

A Python library for the manipulation of linguistic sequences.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

transform functionalities

LinguList opened this issue · comments

transform or manipulate makes another sequence out of a given sequence

  • lingpy.sequence.soundclasses.syllabify (infers syllable boundaries and inserts them in form of +)
  • lingpy.sequence.soundclasses.get_all_ngrams (quite useful NLP function, and a classical example for sequence manipulation, but this function occurs also in sequence.ngrams, so it is duplicated (!))
  • lingpy.sequence.soundclasses.tokens2morphemes

And maybe some of the ngram functions, but they are also rather specific, I think.

Regarding ngrams, I'm not sure this is needed considering that it's rather short to implement:

def ngrams(l):
    for i in reversed(range(len(l))):
        for j in range(len(l) - i):
            yield l[j:j+i+1]
             
> list(ngrams(list('abcdefg')))
[['a', 'b', 'c', 'd', 'e', 'f', 'g'], ['a', 'b', 'c', 'd', 'e', 'f'], ['b', 'c', 'd', 'e', 'f', 'g'], ['a', 'b', 'c', 'd', 'e'], ['b', 'c', 'd', 'e', 'f'], ['c', 'd', 'e', 'f', 'g'], ['a', 'b', 'c', 'd'], ['b', 'c', 'd', 'e'], ['c', 'd', 'e', 'f'], ['d', 'e', 'f', 'g'], ['a', 'b', 'c'], ['b', 'c', 'd'], ['c', 'd', 'e'], ['d', 'e', 'f'], ['e', 'f', 'g'], ['a', 'b'], ['b', 'c'], ['c', 'd'], ['d', 'e'], ['e', 'f'], ['f', 'g'], ['a'], ['b'], ['c'], ['d'], ['e'], ['f'], ['g']]

get_all_posngrams seems a lot more powerful. So I'd rather just not add such a function here.

Just thought about ngram functions. They are basically all easy to implement, also bi, trigrams, and the like. And they are not necessarily needed by now, it would rather be handy to have them in some place, for developing new experiments and algortithms. If needed, one could add ngram functions in a specific ngram module of linse, I think, since they are a specific way of manipulation that one recognizes as something specific.

So in my opinion, we can drop this for the time being and mark this closed.