.cha file word segmentation

Question

.cha file word segmentation

tnwh6921 opened this issue 3 years ago · comments

Hello, may I please know if it would be possible to word segment a .cha file, or if better, a zip folder containing .cha files? Thank you very much!

Jackson L. Lee · Answer 1 · Wed Sep 22 2021 23:21:28 GMT+0800 (China Standard Time)

Hello! If you have CHAT data with unsegmented Cantonese data, you can iterate through the utterances in your CHAT data (e.g., read in your custom data as ZIP / directory of .cha files / a single .cha file, then loop through the utterances as demo-ed in this tutorial). Each utterance should contain the unsegmented Cantonese text string, and you can apply the PyCantonese functions such as segment to it.

(Relatedly, I'm working on a general parsing function that takes Cantonese text data -- please see #30.)

tnwh6921 · Answer 2 · Thu Sep 23 2021 09:19:41 GMT+0800 (China Standard Time)

Thank you for your reply.

I am very excited to know about the new function! May I please confirm if, with the new function, the input would also have to be utterances instead of a .cha file or zip folder?

Thank you again!

Jackson L. Lee · Answer 3 · Thu Sep 23 2021 10:38:21 GMT+0800 (China Standard Time)

With the new parse_text function, you can have your Cantonese text data in a plain text file (.txt perfectly fine, and no CHAT formatting needed), then read in the text file and pass the text string to parse_text. I haven't tested it yet, but I'd imagine something like the following:

# Suppose you have data.txt with your Cantonese text.
with open("data.txt") as f:
    # `f` is a file object for a plain text file,
    # and so the .read() call in the next line gives you the entire file's text as a string.
    corpus = pycantonese.parse_text(f.read())
    # Then do whatever you'd like with the `corpus` object.

In this hypothetical code snippet, because f.read() is a string, parse_text would attempt simple utterance-level segmentation by the punctuation marks {"，", "！", "。"} as well as the EOL character "\n". So this is the case of "input 1: a plain string" described in #30 (comment). If you'd like more control over what counts as an utterance or not, then you'd have to do your own munging to pass in a list of strings (= the case of "input 2: a list of strings" in #30 (comment)).

tnwh6921 · Answer 4 · Thu Sep 23 2021 13:43:58 GMT+0800 (China Standard Time)

I see. Thank you very much!

Jackson L. Lee · Answer 5 · Wed Dec 29 2021 05:37:17 GMT+0800 (China Standard Time)

The new parse_text function has just been released alongside v3.4.0. More docs here.