BloopAI / bloop

bloop is a fast code search engine written in Rust.

Home Page:https://bloop.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Query planer doesn't work for multiple alternations in a row

hsqStephenZhang opened this issue · comments

commented

High level details

I added a unittest of plan("[ab][cd][ef]"), which is a query parse testcase in google's code search project(I know it's pretty old), which is expected to expand into "ace, acf, ... bdf", but the test output failed to inline the query strings

Current situation

  • how it's insufficient: After reading the code, I found that current optimizer won't handle the scenario that has over 3 alternations in a row. I'm not sure if it's in anticipation
  • motivations: I'm learning code search techs and comparing google's code search engine's with bloop, the boolean query builder seems to have some diffs, so I'm figuring it out.
  • why we need to change this: To make the query optimization more sufficient and cover more cases.

Proposal

I've implemented some changes, and have passed the test( plan("(a|b)(c|d)(e|f)(g|h)")) added by my self.

in server/bleep/src/query/planner.rs, The Fragment:::add need to handle the following match arm:

             // TODO: do we need to handle cases where the children are not all literals?
            (Fragment::Literal(lit), Fragment::Dense(Op::Or, mut rhs)) => {
                if rhs.iter().all(|x| matches!(x, Fragment::Literal(_))) {
                    for x in &mut rhs {
                        if let Fragment::Literal(s) = x {
                            *s = lit.clone() + s;
                        }
                    }
                    Fragment::Dense(Op::Or, rhs)
                } else {
                    Fragment::Dense(
                        Op::And,
                        vec![Fragment::Literal(lit), Fragment::Dense(Op::Or, rhs)],
                    )
                }
            }

            // TODO: avoid explosion by adding some constraints
            // TODO: do we need to handle cases where the children are not all literals?
            (Fragment::Dense(Op::Or, lhs), Fragment::Dense(Op::Or, rhs)) => {
                if lhs.iter().all(|x| matches!(x, Fragment::Literal(_)))
                    && lhs.len() <= 100
                    && rhs.iter().all(|x| matches!(x, Fragment::Literal(_)))
                    && rhs.len() <= 100
                {
                    let mut cross = Vec::with_capacity(lhs.len() * rhs.len());
                    for x in &lhs {
                        for y in &rhs {
                            let x = x.as_literal().unwrap();
                            let y = y.as_literal().unwrap();
                            cross.push(Fragment::Literal(x.to_owned() + y));
                        }
                    }
                    Fragment::Dense(Op::Or, cross)
                } else {
                    Fragment::Dense(
                        Op::And,
                        vec![Fragment::Dense(Op::Or, lhs), Fragment::Dense(Op::Or, rhs)],
                    )
                }
            }

and the function optimize::run should call flattern_or after the calling of inline

Next steps
Are there any next steps after we implement these changes?

commented

Given the content of 'acgf', the regex query "[ab][cd][ef]" will recall this doc, and it's unexpected, since this regex query has a determined set of matched trigrams.