Making Sardinas-Patterson algorithm implementation more efficient
sts10 opened this issue · comments
I'm proud of Tidy's feature of something I call "Schlinkert pruning", which is based on the Sardinas-Patterson algorithm.
One part of the algorithm requires us to be able to generate a new code C for any number n (Cn). Here's how it's currently written in Tidy 0.2.77 in src/display_information/uniquely_decodable.rs
:
/// Generate c for any number n
fn generate_cn(c: &HashSet<String>, n: usize) -> HashSet<String> {
if n == 0 {
return c.to_owned();
} else {
let mut cn = HashSet::new();
// generate c_(n-1)
let cn_minus_1 = generate_cn(c, n - 1);
for w1 in c.iter() {
for w2 in cn_minus_1.iter() {
if w1.len() > w2.len() && w1.starts_with(w2) {
// w2 is a prefix word of w1
// so, we're going to add the dangling suffix to a new HashSet
// called cn
cn.insert(w1[w2.len()..].to_string());
}
}
}
// Now the other way? Could we clean this up?
for w1 in cn_minus_1.iter() {
for w2 in c.iter() {
if w1.len() > w2.len() && w1.starts_with(w2) {
// w2 is a prefix word of w1
// so, we're going to add the dangling suffix to a new HashSet
// called cn
cn.insert(w1[w2.len()..].to_string());
}
}
}
cn
}
}
My question is whether we can safely remove this second for
loop by moving that if into the first for
loop and still be implementing Sardinas-Patterson correctly:
Optimization idea 1: Move if into first double for
loop
/// Generate c for any number n
fn generate_cn(c: &HashSet<String>, n: usize) -> HashSet<String> {
if n == 0 {
return c.to_owned();
} else {
let mut cn = HashSet::new();
// generate c_(n-1)
let cn_minus_1 = generate_cn(c, n - 1);
for w1 in c.iter() {
for w2 in cn_minus_1.iter() {
if w1.len() > w2.len() && w1.starts_with(w2) {
// w2 is a prefix word of w1
// so, we're going to add the dangling suffix to a new HashSet
// called cn
cn.insert(w1[w2.len()..].to_string());
} else if w2.len() > w1.len() && w2.starts_with(w1) {
// w1 is a prefix word of w2
// so, we're going to add the dangling suffix to a new HashSet
// called cn
cn.insert(w2[w1.len()..].to_string());
}
}
}
cn
}
}
Removing the loop will increase the speed of running these checks of whether a given list is uniquely decodable, something Tidy does often. But I want to be 100% sure I'm still correctly implementing Sardinas-Patterson.
Optimization idea 2: Remove second if
altogether
/// Generate c for any number n
fn generate_cn(c: &HashSet<String>, n: usize) -> HashSet<String> {
if n == 0 {
return c.to_owned();
} else {
let mut cn = HashSet::new();
// generate c_(n-1)
let cn_minus_1 = generate_cn(c, n - 1);
for w1 in c.iter() {
for w2 in cn_minus_1.iter() {
if w1.len() > w2.len() && w1.starts_with(w2) {
// w2 is a prefix word of w1
// so, we're going to add the dangling suffix to a new HashSet
// called cn
cn.insert(w1[w2.len()..].to_string());
}
}
}
cn
}
}
This is the more radical optimization. Also trying to wrap my head around whether this still fulfills Sardinas-Patterson...
I'm now getting closer to convincing myself the Optimization idea 1 is correct.
Here's a Rust Playground with a little experiment, using example data from this video.
I merged #24 . Closing this for now!