Removing the automatic use of spaces as word separators

Question

Removing the automatic use of spaces as word separators

thetruejacob opened this issue 3 years ago · comments

Hi, I'm continuing this topic from here.

TL;DR I would like to use WBS for writing systems that do not use spaces as word separators.

Could you point me in the right direction - where in the code does a space get inserted before moving on to predicting the next word? I was simply assuming if this piece of code was removed, I would get what I want - a long string of words, and spaces are simply treated as another character. I actually believe I get this behavior with normal beamsearch/bestpath decoder - everything looks nice (but of course less accurate), while the WBS decoder is more accurate but introduces spaces unnecessarily.

Is there a way to use spaces only as a character, indistinguishable from any other character? Why do they keep getting introduced when moving onto the next beam?

This would definitely be helpful for my use case for the Thai language, and I'm sure many other users who want to adapt it to certain languages (Thai, Lao, Tibetan, Burmese, Mongolian etc) would also highly appreciate it.

Thank you again for your continued maintenance on this project.

Harald Scheidl · Answer 1 · Tue Mar 29 2022 03:12:29 GMT+0800 (China Standard Time)

Hi, it's quite a while since I worked on that codebase. To give you detailed instructions I would have to get back to the codebase and look and debug through all the files myself, for which I do not have the time.

As I pointed out in the prev. thread, I think there will be some changes needed both in the language model and the beam classes. But I did not see an "easy fix" back than, so it really means going through the code with some debugger, looking what's going on, and then adapting the code. Use a good C++ IDE like Visual Studio and get started at the main.cpp.

Basically, the changes you want to have should be possible, because it just means allowing new words as soon as the previous word in finished. But much of the code currently relies on the space character, so it might be some work to change this.