Making Sardinas-Patterson algorithm implementation more efficient

Question

Making Sardinas-Patterson algorithm implementation more efficient

sts10 opened this issue 2 years ago · comments

I'm proud of Tidy's feature of something I call "Schlinkert pruning", which is based on the Sardinas-Patterson algorithm.

One part of the algorithm requires us to be able to generate a new code C for any number n (C_n). Here's how it's currently written in Tidy 0.2.77 in src/display_information/uniquely_decodable.rs:

/// Generate c for any number n
fn generate_cn(c: &HashSet<String>, n: usize) -> HashSet<String> {
    if n == 0 {
        return c.to_owned();
    } else {
        let mut cn = HashSet::new();

        // generate c_(n-1)
        let cn_minus_1 = generate_cn(c, n - 1);
        for w1 in c.iter() {
            for w2 in cn_minus_1.iter() {
                if w1.len() > w2.len() && w1.starts_with(w2) {
                    // w2 is a prefix word of w1
                    // so, we're going to add the dangling suffix to a new HashSet
                    // called cn
                    cn.insert(w1[w2.len()..].to_string());
                }
            }
        }
        // Now the other way? Could we clean this up?
        for w1 in cn_minus_1.iter() {
            for w2 in c.iter() {
                if w1.len() > w2.len() && w1.starts_with(w2) {
                    // w2 is a prefix word of w1
                    // so, we're going to add the dangling suffix to a new HashSet
                    // called cn
                    cn.insert(w1[w2.len()..].to_string());
                }
            }
        }
        cn
    }
}

My question is whether we can safely remove this second for loop by moving that if into the first for loop and still be implementing Sardinas-Patterson correctly:

Optimization idea 1: Move if into first double for loop

/// Generate c for any number n
fn generate_cn(c: &HashSet<String>, n: usize) -> HashSet<String> {
    if n == 0 {
        return c.to_owned();
    } else {
        let mut cn = HashSet::new();

        // generate c_(n-1)
        let cn_minus_1 = generate_cn(c, n - 1);
        for w1 in c.iter() {
            for w2 in cn_minus_1.iter() {
                if w1.len() > w2.len() && w1.starts_with(w2) {
                    // w2 is a prefix word of w1
                    // so, we're going to add the dangling suffix to a new HashSet
                    // called cn
                    cn.insert(w1[w2.len()..].to_string());
                } else if w2.len() > w1.len() && w2.starts_with(w1) {
                    // w1 is a prefix word of w2
                    // so, we're going to add the dangling suffix to a new HashSet
                    // called cn
                    cn.insert(w2[w1.len()..].to_string());
                }
            }
        }
        cn
    }
}

Removing the loop will increase the speed of running these checks of whether a given list is uniquely decodable, something Tidy does often. But I want to be 100% sure I'm still correctly implementing Sardinas-Patterson.

Optimization idea 2: Remove second if altogether

/// Generate c for any number n
fn generate_cn(c: &HashSet<String>, n: usize) -> HashSet<String> {
    if n == 0 {
        return c.to_owned();
    } else {
        let mut cn = HashSet::new();

        // generate c_(n-1)
        let cn_minus_1 = generate_cn(c, n - 1);
        for w1 in c.iter() {
            for w2 in cn_minus_1.iter() {
                if w1.len() > w2.len() && w1.starts_with(w2) {
                    // w2 is a prefix word of w1
                    // so, we're going to add the dangling suffix to a new HashSet
                    // called cn
                    cn.insert(w1[w2.len()..].to_string());
                }
            }
        }
        cn
    }
}

This is the more radical optimization. Also trying to wrap my head around whether this still fulfills Sardinas-Patterson...

Sam Schlinkert · Answer 1 · Tue Jan 24 2023 15:36:01 GMT+0800 (China Standard Time)

I'm now getting closer to convincing myself the Optimization idea 1 is correct.

Here's a Rust Playground with a little experiment, using example data from this video.

Sam Schlinkert · Answer 2 · Wed Jan 25 2023 00:33:56 GMT+0800 (China Standard Time)

I merged #24 . Closing this for now!