mikahama / uralicNLP

An NLP library for Uralic languages such as Finnish, Skolt Sami, Moksha and so on. Also supporting some non-Uralic languages such as Spanish, French, Arabic, Swedish, Norwegian, Russian and English

Home Page:http://uralicnlp.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Refactor to make a more generalized cg3 library?

reynoldsnlp opened this issue · comments

I am working on a similar project for Russian, and I just got around to trying to implement vislcg3 in python, and I found your repo.

The same way that both of our modules depend on the hfst python module, I was planning on using/making a separate module for cg3 that my project would depend on.

Would you be interested in refactoring your code to split into two different modules? One to be a more generalizable python implementation of cg3 subprocessing, and then your uralicNLP and my udar could both simply depend on that module for the cg3 parts of our projects. (Whereas currently, if I understand correctly, your cg3.py combines downloading models, checking your online service, and the actual working of calling the subprocess to process input with a grammar.)

Let me know if you're interested.

Hi, I am thinking of making it a bit easier to bring your own models into UralicNLP. Currently, there is some documentation here https://github.com/mikahama/uralicNLP/wiki/Models so essentially, you can just copy your cg and hfst files into one of the model folders uralicNLP uses and use them in the exact same way as the downloaded models.

You might be interested in running uralicApi.download("rus") as a starting point. This downloads the Russian FSTs and CG that are published in Giellatekno.

Would this work for you or are you interested in a lower level access to the cg output?

It's great to know that you can access any of the Giellatekno FSTs directly. However, my particular FST has several unique features that don't really have parallels with the others, so I don't think it makes much sense to mash them together.

Now that I look at it more carefully, the CG3 part of this will be trivial to implement as a subprocess, so I don't think I'm going to work on it as a separate module, either.

In any case, I am very happy to have found your project. Good luck!