Performance idea: Partition before executing current uniform indentation search

Question

Performance idea: Partition before executing current uniform indentation search

schneems opened this issue 3 years ago · comments

This is a failing test: https://github.com/zombocom/dead_end/tree/schneems/partition. The file is nine thousand lines and it takes a tad over 1 second to parse which means it hits the timeout.

We are already fairly well optimized for the current algorithm so to be able to handle arbitrarily large-sized files we will need a different strategy.

The current algorithm takes relatively small steps in the interest in producing a good end result. That takes a long time.

Here's my general idea: We can split up the file into multiple large chunks before running the current fine-grained algorithm. At a high level: split up the file into 2 parts and see which holds the syntax error. If we can isolate the problem to only half the file then we've dropped processing time in half (relatively). We can run this partition step a few times.

The catch is that some files (such as the one in the failing test cannot be split without introducing a syntax error (since it starts with a class declaration and ends with an end). To account for this we will need to split in a way that's lexically aware.

For example on that file, I think the algorithm would determine that it can't do much with indentation 0 so it would have to go to the next indentation, there it could see there are N chunks of kw/end pairs, it could divide into N/2 and see if one of those sections holds all of the syntax errors. We could perform this division several times to arrive at a subset of the larger problem, then run the original search array on it.

The challenge is, that we will essentially need to build an inverse of the existing algorithm. Instead of starting with a single line and expanding towards indentation zero, we'll start with all the lines and reduce towards indentation max.

The expensive part is checking code is valid via Ripper.parse, sub dividing large files into smaller files can help us isolate problems sections with fewer parse calls, but we've got to make sure the results are as good.

An alternative idea would be to use the existing search/expansion logic to perform more expansions until a set of N blocks are generated then check all of them at once. Then once the document problem is isolated, go back and re-parse only the N blocks with the existing. Algorithm. (Basically the same idea as partitioning, but we're working from the same direction as the current algorithm, just taking larger steps (which means fewer Ripper.parse) calls. However we would still need a way to sub-divide the blocks with this process in the terminal case that the syntax error is on indentation zero and the document is massive and all within one kw/end pair.

Richard Schneeman · Answer 1 · Wed Jan 05 2022 00:22:58 GMT+0800 (China Standard Time)

Partitioning is very difficult to do. In fact that was originally my idea for the search algorithm, but I never got it to work. So here's my new buzzword:

Pool testing.

In the same way that a classroom might want to test 30 children with 1 covid test kit by testing all of them in the same pool and then going back to re-test classes that come up positive we can check for syntax errors.

Roughly we can do this with syntax search by: Cheaply but inaccurately building all nodes in the syntax tree using our existing logic (no partitioning needed, battle tested logic design). Instead of testing a small chunk of code to see if it's invalid, test many at the same time in one pass. We then iteratively repeat the process until we narrow down a small subset of code.

Depending on how well the process works, it may replace the existing search. If it's not as accurate we can still use this process to narrow down our search to a smaller area and then pass it into the original algorithm that is slower but has better results.

Pool testing: Block generation

One example could be, building a tree of possible blocks, then testing all blocks with the same indentation level at the same time rather than individually.

The hard part is that we need to minimize bookkeeping or the building of the tree for large files is too slow (even without any calls to Ripper). The existing frontier/search algorithm keeps track of lines that are captured and those that are not. It also checks for when one block consumes/captures another block. These are both expensive checks.

Pyramids of power

One way to look at code when you squint is to see sideways pyramids:

\
 >
/
\
  \
   \
    >
   /
 /
/
\
 \
  >
 /
/
\
  \
   \
    >
   /
 /
/

Inside of that "pyramid" there are many peaks. If we pick one peak, turn it into a block and iteratively expand it we will capture everything down to indentation 0. We can then break up the source code and repeat the process above and below the "pyramid".

We can then search via asking "which pyramid is invalid or not". Once we find the invalid pyramids, we can then partition via indentation level by asking "at what indentation level does this begin to be invalid"?

Once we know the indentation level where it starts we can then segment and look at individual blocks.

Segment/split existing blocks

A challenge to this approach is that blocks become "expanded" and will capture the same content multiple times. For example:

          char.bytesize.times { @indices << index }

expands to:

        .each_char
        .with_index(start) do |char, index|
          char.bytesize.times { @indices << index }
        end

expands to:

      @indices = []

      line
        .each_char
        .with_index(start) do |char, index|
          char.bytesize.times { @indices << index }
        end

Expands to:

    def initialize(start, line)
      @indices = []

      line
        .each_char
        .with_index(start) do |char, index|
          char.bytesize.times { @indices << index }
        end
    end

So essentially EVERY block that is expanded in this pyramid will contain the initial line:

          char.bytesize.times { @indices << index }

That same problem is repeated at larger increments/chunks. At every step of the expansion process, we may capture an entire method up and down. For example:

    def ==(other)
      other.is_a?(Location) && start_line == other.start_line &&
        start_char == other.start_char && end_line == other.end_line &&
        end_char == other.end_char
    end

Expands to:

    def initialize(start_line:, start_char:, end_line:, end_char:)
      @start_line = start_line
      @start_char = start_char
      @end_line = end_line
      @end_char = end_char
    end

    def ==(other)
      other.is_a?(Location) && start_line == other.start_line &&
        start_char == other.start_char && end_line == other.end_line &&
        end_char == other.end_char
    end

    def to(other)
      Location.new(
        start_line: start_line,
        start_char: start_char,
        end_line: other.end_line,
        end_char: other.end_char
      )
    end

My theory is that I can use this structure to my advantage. I know that for a given indentation level, the block with the most number of lines contains ALL other code of that line number or greater (within a given pyramid).

By sorting them I can apply a bsearch to see when the invalid code begins. This will give me possibly a large area with many methods and doesn't help by itself.

However, I can repeat the process and see which sub-blocks within that block are valid. Going from the above example if I detect the problem is here:

    def initialize(start_line:, start_char:, end_line:, end_char:)
      def whoops # <=============
      @start_line = start_line
      @start_char = start_char
      @end_line = end_line
      @end_char = end_char
    end

    def ==(other)
      other.is_a?(Location) && start_line == other.start_line &&
        start_char == other.start_char && end_line == other.end_line &&
        end_char == other.end_char
    end

    def to(other)
      Location.new(
        start_line: start_line,
        start_char: start_char,
        end_line: other.end_line,
        end_char: other.end_char
      )
    end

Then working backward I would be able to see that the internal block from the prior expansion is valid:

    def ==(other)
      other.is_a?(Location) && start_line == other.start_line &&
        start_char == other.start_char && end_line == other.end_line &&
        end_char == other.end_char
    end

Using this info I could "split" the known large bad block, by removing the known good internal small block.

Summary

Recursively generate "pyramids" and determine which hold syntax errors. We can pool multiple pyramids in a single check if it proves to help performance.
Use a bsearch on the problem pyramids to determine what indentation level the problem starts at
Sort that level to determine where inside of that level the problem starts at
Use the rest of the block to "split" the problem section to isolate the problem

Once we've narrowed down the problem to N small chunks we can call the very expensive document syntax check N times. If we're lucky this process will give us roughly the same results as the existing algorithm and we can be done. If we're unlucky, we can feed these smaller sections into the existing code search to give us decent results.

Recap the recap: Optimize an algorithm around building a tree structure as quickly as possible without calling Ripper. Identify ways we can pool together code in our tree to test large chunks at a time to minimize Ripper calls. Use this info to either narrow down our search space or complete the entire search.

Richard Schneeman · Answer 2 · Fri Jan 07 2022 00:39:31 GMT+0800 (China Standard Time)

Here's a spike of that idea (on the branch schneems/partition-inside-out) e3c6c9b

It works for the given case but fails on others

Richard Schneeman · Answer 3 · Tue Feb 08 2022 05:26:51 GMT+0800 (China Standard Time)

That didn't work https://github.com/zombocom/dead_end/tree/schneems/block-node-attempt-2 does seem to hold though.

It doesn't use any of the weird-ish new ideas here. Instead, it doubles down on existing basics by re-writing the whole algorithm from the ground up.

Improve the quality of block construction

The more accurately we can build code blocks the less we have to lean on Ripper to confirm/evaluate our code blocks. This helps because parsing/validating code is expensive.

Old algorithm: Only aware of keyword/end.
New algorithm: Aware of keyword/end, (), {}, and []. This means we can make better heuristics for the construction of blocks and for exploration.

Block creation/evaluation decoupling

As we evaluate blocks to see if they hold a syntax error and then see if they hold THE source of all syntax errors we're having to call the parser, a lot.

Old algorithm: Evaluated blocks/nodes as they're built.
New algorithm: Separates building nodes/blocks from evaluation

With the old method, a tree was never actually built. Instead, groups of lines would be identified as a "block" and then it would be evaluated. If it was valid source code it would be "hidden" and the process would recursively repeat. This worked (quite well) however the rules for this evaluation and expansion were difficult to understand and modify. Even though I wrote them, it's hard to understand the second and third-order effects of changing the old algorithm.

Beyond usability, this outside-in parsing means we somewhat assume all code has an equal chance of containing the syntax error and we must validate high indentation code is valid before we can move in to look at low indentation code.

Due to improvements from block construction we now have stronger guarantees while evaluating blocks. Instead of starting at the outer indentation and working in, we can go the other direction. We start with the root which contains the entire document. It is split into several "blocks". We can then apply inspection to follow bad/invalid source code to it's ultimate cause instead of having to recursively evaluate ALL the valid source code as well.

One way to think of it is: Before we were trying to eliminate all the valid code to find the invalid code. Now we can instead "follow" the invalid code to find its ultimate source.

Normalize states in building and evaluation

Old algorithm: Expansion has 2 different "phases" and there is a third phase where code lines are converted into a code "block".
New algorithm: At search time there is only one logical container which is a BlockNode. While building the tree there is a single expansion algorithm.

This is somewhat of a quality of life improvement from a development perspective. Now it's clearer how the various expansion rules interact as there is only one set of rules. Because all containers are normalized the container can have specialized information about itself (specifically, how will it expand on its next step).

Normalize intersection operations

Old algorithm: We manually had to iterate over all blocks in the frontier to check for "capture/engulf" state, and we ignored any overlap/intersect information. This was actually a performance bottleneck in the code that strongly resisted refactoring
New algorithm: In block/tree construction all interactions generate an intersection of one or more blocks. This information is captured in the structure of nodes instead of being independent. This allows us to avoid an extra "engulf" check and conceptually lead to a better, more natural, data structure

Better prioritization

In other experiments, I noticed that the existing "frontier" priority queue logic didn't behave as expected. Specifically, this code wouldn't be evaluated in an optimal order:

  def on_args_add(arguments, argument)
    if arguments.parts.empty?

      Args.new(parts: [argument], location: argument.location)
    else

      Args.new(
        parts: arguments.parts << argument,
        location: arguments.location.to(argument.location)
      )
    end
  # missing end here

The reason the current algorithm doesn't do the right thing is it only looks at the current indentation of a code block. I found that if I got better information to my algorithm I could reliably predict what the "next indentation" would be after an expansion. I then baked that information into the expansion process by first sorting by "next indent", then "current indent", then falling back to line number.

The old algorithm really resisted adding this information in as I had made other tweaks to different parts of the algorithm to somewhat course-correct for a bad expansion that no longer held true when I fixed how nodes expanded. Also because the old process searches and evaluates at the same time it's impossible to tell what the next indentation will be before the prior blocks have been executed.

Based on the above optimizations I was able to have strong confidence in my ability to predict the next indentation level in the new algorithm.

Better code "scanning" via storing block nodes as a doubly linked list

Old algorithm: The AroundBlockScan uses the document represented as a continuous array so that it can be indexed into. This makes it very easy to scan above/below a given index, but has the annoying property that you can only ever scan a complete document. If you want to get clever and split the document into multiple subsections the AroundBlockScan won't let you. This property also meant that we had to guard on the case of empty and hidden lines everywhere that we wanted to scan. It made the code messy and somewhat hard to reason about.
New algorithm: Above/Below are represented as a doubly-linked list on the blocks. This allows us to quickly scan up and down without needing the full document. The other benefit is that we can skip lines or blocks of code without having to guard against their existence (for example empty lines). While the old search used empty lines, I've found that with improved heuristics they're not needed, and removing them from the core data structure helps to normalize which simplifies all other operations.

The combination of all these properties on BlockNode turn it from a tree into a graph data structure. Once built, the structure provides a high-level representation of the domain that can be manipulated with comparative ease.

If you consider that valid syntax can be represented as a tree (AST) it seems a logical progression that a graph if a syntax error is caused due to ambiguity that a data structure to hold the invalid syntax would need to have more data than a valid tree. So it somewhat makes sense that a graph or graph-like data structure would be a good choice to represent invalid source code.

End results

I'm still working on it, but the improvements needed to all come together since they all impact one another. I've experimented with each of these in isolation, but since the old algorithm was more "grown" than written with intention it has a lot of coupled logic.

With the new insights, I started from scratch with some test cases I knew were difficult and somewhat contradictory. I slowly built up data structures that had properties I wanted and then eventually fed them into my test cases. From there I tweaked the new algorithm in much the same way I developed the old algorithm, however with more tools, understanding, and experience.