messense / jieba-rs

The Jieba Chinese Word Segmentation Implemented in Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Allow load custom HMM model

messense opened this issue · comments

Ref #23

API-wise, Jieba should expose a struct HmmModel like proposed in #92.

However, unlike in the PR, it might be better to go further and take the HmmModel ownership/reference out of the Jieba struct entirely.

The reason for this is looking around at a few (very random, possibly not represenative) real-world usages, people seem to be confused about whether or not to set hmm to true or false and what the effect of this choice is. In fact, it feels like people just think of "Jieba" as a segmentation method, not "Jieba with hmm" or "Jieba with additional dictionaries" or "Jieba with the default hmm dictionary."

Semantically, it might be clearer for the non-expert user of Jieba, if the API encoded how these options are used.

For example, let's assume the HmmModel was removed from Jieba's construction. You could imagine

trait NoMatchCutter {
  pub(crate) fn cut_with_allocated_memory<'a>(
      sentence: &'a str,
      words: &mut Vec<&'a str>,
      V: &mut Vec<f64>,
      prev: &mut Vec<Option<Status>>,
      path: &mut Vec<Status>);
}

cut<'a>(sentence: &'a String,  Some<NoMatchCutter>) -> Vec<&'a str> {
   ...
}

And then the calling code would pass in an HmmModel decoupling the HmmModel state from the Jieba dictionary state.

As a variant of the above, you could consider doing

cut<'a>(sentence: &'a String) -> Vec<&'a str> { ... }
hmm_cut<'a>(sentence: &'a String,  hmm_model: HmmModel) -> Vec<&'a str> { ... }

But you'd have to annotate the hmm_xxx for each of the variants of Cut.

Lastly, you could imagine doing a hybrid where there is a

trait Cutter {
  pub fn cut<'a>(&self, sentence: &'a str) -> Vec<&'a str>;
  pub fn cut_all<'a>(&self, sentence: &'a str) -> Vec<&'a str>;
  pub fn cut_for_search<'a>(&self, sentence: &'a str) -> Vec<&'a str>;
  pub fn tag<'a>(&'a self, sentence: &'a str) -> Vec<Tag>;
  pub fn tokenize<'a>(&self, sentence: &'a str, mode: TokenizeMode) -> Vec<Token<'a>> ;
}

Then Jieba would implement Cutter but also expose a

cut_with_fallback<'a>(sentence: &'a String,  NoMatchCutter) -> Vec<&'a str> { ... }
cut_all_with_fallback<'a>(sentence: &'a String,  NoMatchCutter) -> Vec<&'a str> { ... }

And you could have a

struct HmmFallbackCutter<'a> {
   jieba: &'a Jieba;
   hmm_model: &'a HmmModel;
}

/// WARNING: This should stay a very thin adapter that is easily constucted.
/// Do not add state or complex initialization or it will make it difficult for other
/// languages to interface due to the lifetime bound.
impl HmmFallbackCutter<'a> {
  pub fn cut<'a>(&self, sentence: &'a str) -> Vec<&'a str> {
    self.jieba.cut_with_fallback(sentence, self.hmm_model)
  }
  ...
}

The last one with the proxy pattern feels the "cleanest".
The first one with the Option<HmmModel> seems the simplest.

Both seem better than passing a weird "true/false" to cut() function where you don't know, on first glance, if hmm is being used somehow on the dict or only on terms that were not findable in the dict.