jacksonllee / pycantonese

Cantonese Linguistics and NLP

Home Page:https://pycantonese.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

.cha file word segmentation

tnwh6921 opened this issue · comments

Hello, may I please know if it would be possible to word segment a .cha file, or if better, a zip folder containing .cha files? Thank you very much!

Hello! If you have CHAT data with unsegmented Cantonese data, you can iterate through the utterances in your CHAT data (e.g., read in your custom data as ZIP / directory of .cha files / a single .cha file, then loop through the utterances as demo-ed in this tutorial). Each utterance should contain the unsegmented Cantonese text string, and you can apply the PyCantonese functions such as segment to it.

(Relatedly, I'm working on a general parsing function that takes Cantonese text data -- please see #30.)

Thank you for your reply.

I am very excited to know about the new function! May I please confirm if, with the new function, the input would also have to be utterances instead of a .cha file or zip folder?

Thank you again!

With the new parse_text function, you can have your Cantonese text data in a plain text file (.txt perfectly fine, and no CHAT formatting needed), then read in the text file and pass the text string to parse_text. I haven't tested it yet, but I'd imagine something like the following:

# Suppose you have data.txt with your Cantonese text.
with open("data.txt") as f:
    # `f` is a file object for a plain text file,
    # and so the .read() call in the next line gives you the entire file's text as a string.
    corpus = pycantonese.parse_text(f.read())
    # Then do whatever you'd like with the `corpus` object.

In this hypothetical code snippet, because f.read() is a string, parse_text would attempt simple utterance-level segmentation by the punctuation marks {",", "!", "。"} as well as the EOL character "\n". So this is the case of "input 1: a plain string" described in #30 (comment). If you'd like more control over what counts as an utterance or not, then you'd have to do your own munging to pass in a list of strings (= the case of "input 2: a list of strings" in #30 (comment)).

I see. Thank you very much!

The new parse_text function has just been released alongside v3.4.0. More docs here.