Generated file too large

Question

Generated file too large

ccleve opened this issue 2 years ago · comments

I've implemented a modified version of the UAX 29 word breaking rules using re2c. It's not that complex, but it does use unicode character classes, which can be lengthy.

Unfortunately, the generated file is 120,000 lines long, and it has slowed my compile time from 5 seconds to more than a minute.

What, in general, is the right way to reduce the size of this file? I saw there was a simple trick in #399, but this may be less simple.

I've attached the file itself. Excuse the odd formatting; it's an adaptation of an Antlr file.

uax29.txt

Ulya Trofimovich · Answer 1 · Tue Nov 08 2022 16:06:25 GMT+0800 (China Standard Time)

I don't see any simple change that could make this a lot faster, but here is a few ideas:

Don't use counted repetition; replace {2,7} with + and check the sequence length as a post processing step, if needed.
Make HYPHENATED_WORD and SOFT_HYPHENATED_WORD one rule defined as (AHLETTER|NUMERIC)+ (HYPHEN | SOFT_HYPHEN) (AHLETTER|NUMERIC)+. If you need to distinguish them, add a tag only on one side of alternative: HYPHEN | @t SOFT_HYPHEN and check if the tag is set in semantic action.
To mitigate the problem now, compile this file with -O1 not -O2.

Generally there is a technique to split the lexer block into multiple blocks, where the main block checks a non-repeating unique prefix and dispatches to other blocks, which are specific to each lexeme kind and handle repetiiton. I'm not sure if it can be applied in this case.

I will have another look later. Are these the rules you are referring to?

Chris Cleveland · Answer 2 · Tue Nov 08 2022 23:40:08 GMT+0800 (China Standard Time)

Replacing counted repetition got the line count down from 120,000 to 96,000.

Combining the hyphenated terms reduced it from 96,000 to 67,000.

Getting rid of WORD_WITH_EXTENDERS, which are rare, got it to 53,000.

By -01 and -02 you mean C compiler flags? Sorry, I should have specified that I'm generating Rust.

Yes, those are the UAX 29 rules. The lexer is a heavily modified version of it.

Thanks for your help on this.

Ulya Trofimovich · Answer 3 · Wed Nov 09 2022 05:13:34 GMT+0800 (China Standard Time)

By -01 and -02 you mean C compiler flags? Sorry, I should have specified that I'm generating Rust.

I thought there must be similar optimization options for rust, and they will have the same effect on compile time since clang and rustc are both based on llvm.

[ ... ] got it to 53,000.

Is this enough to reduce compile time?

Chris Cleveland · Answer 4 · Wed Nov 09 2022 07:56:08 GMT+0800 (China Standard Time)

Is this enough to reduce compile time?

Yes, it's tolerable now. Lexing performance is actually pretty good.

Is it possible to analyze to see what's generating most of the code?

Ulya Trofimovich · Answer 5 · Wed Nov 09 2022 15:14:11 GMT+0800 (China Standard Time)

Is it possible to analyze to see what's generating most of the code?

There is no tool to do that, but you can comment out rules (or parts of regular expressions) one at a time and see how it affects output size. It may be not a single rule, but a combination of them. Typical bad case is two overlapping rules that use some form of repetition (e.g. the combination of x{n} and y* where x and y overlap means that the star has to be unfolded at least n times in the automaton). Typical good case is when rules have distinct prefixes, so their "tails" are independent.