New output style request of pycantonese.characters_to_jyutping
hgneng opened this issue · comments
Feature you are interested in and your specific question(s):
I want new output style request of pycantonese.characters_to_jyutping something like this:
>>>pycantonese.characters_to_jyutping('香港人講廣東話', style='single_jyutping_list')
['hoeng1', 'gong2', 'jan4', 'gong2', 'gwong2', 'dung1', 'waa2']
What you are trying to accomplish with this feature or functionality:
I want to get a simple chat-to-jyutping list. The original output style is a bit hard to do further processing.
Additional context:
Currently, I use following code to achievement my purpose:
def _cantonese_character_to_jyutping(text: str) -> List[str]:
jyutpings = pycantonese.characters_to_jyutping(text)
ret = []
for word in jyutpings:
jyutpingWord = word[1]
ret.extend(re.findall(r'[a-zA-Z]+[0-9]+', jyutpingWord))
return ret
Hello! It appears that the output of the characters_to_jyutping
function provides additional information (word segmentation, plus the Chinese/Cantonese characters for the given word segmentation) that you're not interested in, and that it's just a few lines of code of your own (which you already have figured out) to post-process the result of characters_to_jyutping
for what you want. So I'm not sure if it's worth adding options to characters_to_jyutping
as you've suggested.
Alternatively, to get what you would like, combining characters_to_jyutping
with the implemented parse_jyutping
function would also work (so that you don't have to do regex parsing on your own to break up the Jyutping string by syllables):
In [1]: import pycantonese
In [2]: pycantonese.__version__
Out[2]: '3.4.0'
In [3]: result = []
In [4]: for _, jyutpings in pycantonese.characters_to_jyutping('香港人講廣東話'):
...: for jp in pycantonese.parse_jyutping(jyutpings):
...: result.append(str(jp))
...:
In [5]: result
Out[5]: ['hoeng1', 'gong2', 'jan4', 'gong2', 'gwong2', 'dung1', 'waa2']
Thank you for your reply. It doesn't matter whether the new output style is supported. I just not quite familiar with Python and it costs me a few more minutes to ask AI how to do it. In fact, after some more investigation, I find that I need the original style with word segmentation.