messense / jieba-rs

The Jieba Chinese Word Segmentation Implemented in Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make APIs for TFIDF and TextRank that do NOT take a reference to Jieba?

awong-dev opened this issue · comments

I tried to implementing a binding of Jieba-rs in Elixir here https://github.com/awong-dev/jieba.

When it came to the TFIDF<'a> and TextRank<'a> structs, it became hard (impossible?) to provide a sensible API to Elixir because the lifetimes of the types required that they be stack-scoped.

In an ideal world, you would conceptually want to allow create a TFIDF/TextRank struct that had its lifetime managed Elixir where you can load into the TFIDF and TextRank instances (eg via add_stop_word() or even load_dict()) once and then use them later as needed.

With the current setup where jieba_rs requires TFIDF and and TextRank to be bound to the stack are constructed in means that any wrapping API has to recreate the two structs on each call to extract_tags(). See my code here:

https://github.com/awong-dev/jieba/blob/main/native/rustler_jieba/src/lib.rs#L232

If there are not many stop words, etc., this is cheap but if there are a lot, this is very wasteful.

How would you feel about exposing something like

pub struct TFIDFState {
    idf_dict: HashMap<String, f64>,
    median_idf: f64,
    stop_words: BTreeSet<String>,
}

impl TFIDFState {
  pub fn clone() -> Self {...}
}

impl<'a> TFIDF<'a> {
    pub fn new_with_jieba_and_state(jieba: &'a Jieba, TFIDFState state) -> Self {...}
    pub fn extract_state() -> Self { /* TFIDF data put into an empty state. */ }
    ...
}

and something similar for TextRank.

This would allow both TFIDF<'a> and TextRank<'a> be used as cheap-to-construct, thin facades with lifetimes bound to a jieba instance and not break the existing API.

Alternatively, if we're willing to break API compat, I wonder if TFIDF and TextRank would be better off NOT binding jieba during construction. If the KeywordExtract was

fn extract_tags(
    &self,
    &jieba: Jieba,
    sentence: &str,
    top_k: usize,
    allowed_pos: Vec<String>
) -> Vec<Keyword.html>

where you pass in the wanted jieba on each invocation of extract_tags, we'd avoid the lifetime coupling of both structs entirely and simplify the API.

As a bonus, it is easy to use one KeywordExtract instance with multiple segmenters in case you wanted to test behavior with different Jieba dictionaries.

Thoughts?