Enumerate Regular Expressions the Fun Way.
The regular expression becomes , which tells us that there are words of size in this language.
Regex Enumerator takes in a regular expression, and spits out a closed-form formula for the number of -letter words in your language.
Or how I learned to stop worrying and then lost my mind; TeX rendered using readme2tex
Have you ever wondered about how many different strings you can form that fits your favorite regex?
Yeah, chances are you probably haven't. But it's on your mind now.
Here's one of my favorite regular expressions:
It specifies the class of languages that are comma-separated list of strings of zeros. For example, `000, 0, 00000` belongs to this language, but `0, , 0` and `0, 0, ` do not.Now, it might seem like a masochistic endeavor, but if you enumerate every possible word in this language, you'll find that there are 0 empty strings, 1 single letter string, 1 two letter string, 2 three letter strings, 3 four letter strings, and so on. This pattern looks like
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, ...
Why, that is the fibonacci sequence! How did it end up here of all places?
Now, I could give you a combinatorial interpretation for this amazeballs result, but I still get shivers up my spine whenever I think back to my undergrad Combinatorics course. Instead, I'll give you a more general way to compute these enumerations as well as an algorithm that can do all of the tedious pencil-pushin on your behalf.
However, that's not the end of it. It turns out that this algorithm can also compute a closed-form formula for this sequence.
where is the number of comma-separated lists of size .Yes, that's right, there is an algorithm that can determine the closed-form counting expression for every (unambiguous) regular expression!
You will need Python 2.7 or up, though it seems to be most stable on Python 3+.
git clone https://github.com/leegao/RegexEnumerator.git
cd RegexEnumerator
sudo python setup.py develop
Note that you will need to install numpy
, scipy
, and sympy
in order to support solving a few
linear equations and to translate numerically computed roots into algebraic forms, if they are available.
To uninstall, run
pip uninstall RegexEnumerator
We are using vanilla regular expression, so the standard *
, +
, ?
, |
variety. Note that for +
and ?
, we've
encoded them using just *
and |
instead:
Here, %
denotes the "empty" transition in formal languages. In effect, it acts as the
identity element of concatenation, so that . For example, the regular expression of
comma delimited language can be encoded as
e = '0' # or any other regular expression
regex = '({e}{e}*,)*{e}{e}*'.format(e = e)
regex_enumerate
offers a few library functions for you to use.
-
enumerate_coefficients
: Returns an infinite generator counting the number of words of size in your language. Since it depends on numerical approximations, you'll have to contend with round-off and truncation errors.from regex_enumerate import enumerate_coefficients from itertools import islice print(list(islice(enumerate_coefficients('(0+1)*0+'), 10))) # [0.0, 1.0, 0.99999999999999989, 1.9999999999999998, 2.9999999999999996, 4.9999999999999982, 7.9999999999999982, 12.999999999999998, 20.999999999999993, 33.999999999999986].
-
exact_coefficients
: Uses a dynamic program to compute the same coefficients. Useful for validation and pure computation, but does not reveal any algebraic structure within the problem.from regex_enumerate import exact_coefficients from itertools import islice print(list(islice(exact_coefficients('(0+1)*0+'), 10))) # [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
-
algebraic_form
: Computes the algebraic closed form counting formula of a regular expression.from regex_enumerate import algebraic_form, evaluate_expression from sympy import latex, pprint formula = algebraic_form('(0+1)0+') # Normal Form print(formula) # 2.0*DiracDelta(n) + 1.0*DiracDelta(n - 1) + binomial(n + 1, 1) - 3 # Latex print(latex(formula)) # 2.0 \delta\left(n\right) + 1.0 \delta\left(n - 1\right) + {\binom{n + 1}{1}} - 3 # ASCII/Unicode pretty print print(pprint(formula)) # /n + 1\ # 2.0*DiracDelta(n) + 1.0*DiracDelta(n - 1) + | | - 3 # \ 1 / print(evaluate_expression(formula, 10)) # 8
The magic behind this will be discussed in the next section. The code looks like
Note that this differs from the above since we're enumerating instead of . -
check_on_oeis
: This will search https://oeis.org for a potential combinatorial interpretation of your enumeration.from regex_enumerate import check_on_oeis sequences = check_on_oeis("(0+,)*0+", start=5) for oeis in sequences: print('%s: https://oeis.org/%s' % (oeis.name, oeis.id)) # Fibonacci numbers: https://oeis.org/A000045 # Pisot sequences E(3,5), P(3,5): https://oeis.org/A020701 # Expansion of (1-x)/(1-x-x^2): https://oeis.org/A212804 # Pisot sequence E(2,3): https://oeis.org/A020695 # Least k such that the maximum number of elements among the continued fractions for k/1, k/2, k/3, k/4 : https://oeis.org/A071679 # a(n) = Fibonacci(n) mod n^3: https://oeis.org/A132636 # Expansion of 1/(1 - x - x^2 + x^18 - x^20): https://oeis.org/A185357 # Nearly-Fibonacci sequence: https://oeis.org/A264800 # Pisot sequences E(5,8), P(5,8): https://oeis.org/A020712 # a(n) = s(1)t(n) + s(2)t(n-1) + : https://oeis.org/A024595
In addition, regular expressions correspond to the family of rational functions (quotient of two polynomials). To see the generating function of a regular expression, try
from regex_enumerate import generating_function
from sympy import latex
print(latex(generating_function("(0+1)*0+")))
# \frac{1.0 z}{- 1.0 z^{2} - 1.0 z + 1.0}
which outputs
There are many regular expressions that are ambiguous. For example, the regular expression
is inherently ambiguous. On encountering a `0`, it's not clear which side of the bar it belongs to. While this poses no challenges to parsing (since we don't output a parse-tree), it does matter in enumeration. In particular, the direct translation of this expression will claim that there are 2 strings of size 1 in this language.[](To remedy this, you can try to use regex_enumerate.disambiguate(regex)
, but it's not completely clear
that this is correct. Therefore, know that
for some regular expressions, this technique will fail unless you manually reduce it to an unambiguous form.
There is always a way to do this, though it might create an exponential number of additional states.)
Now, all of this might feel a little bullshitty. (Shameless plug, for more bullshitty math, check out http://bullshitmath.lol) Is there any real justification for what you are doing here? Am I just enumerating a bunch of pre-existing cases and running through a giant table lookup?
Well, it's actually a lot simpler (at least the algorithm is) than that. However, there's a bit of a setup for the problem.
Let's rewind back to our first example; that of enumerating comma-separated sequences of x
es:
Let's start with the sequence of x
es: . This language, in an infinitely expanded form, looks like
Now, here's a trick. Let's pretend that our bar () is a plus sign (), so that
This looks remarkably familiar. In fact, if you are working within a numerical field, then a little bit of precalculus would also show that
Could there be some connection here? Well, let's find out. To do this, let's equate the two expressions:
so and if we pretend that each regular expression has a numerical value.In fact, this works for every regular expression. For any regular expressions and for any letters we have
As long as you don't need to invoke the axiom of multiplicative-commutativity, this reduction works.For example, for the comma-separated list example, we have
Note here that is a variable! It might be tempting to try to simplify this further. Letting denote the comma, we might try
But this requires a crucial axiom that we do not have:
- We do not have multiplicative commutativity, so we couldn't merge , since no longer know whether this is or . []( This begs a natural question. If we can't take inverses or negate things, then why do we admit the expression ? Well, in this language, that term is atomic. Therefore, we cannot break it down and look at it as a subtraction followed by an inverse; it is just . I'll clear this up later. )
Now that we have this weird "compiler" taking us from regular expressions to numerical formulas, can you tell us what it means for a regular expression to take a numerical value?
The answer: none. There is no meaning to assign a value of say to , or that . It doesn't mean anything, it's just pure gibberish. Don't do it, except maybe values of or ; we'll get to that later.
Okay. So why did we go on this wild goose-hunt if their values don't even mean anything?
It turns out that the value of a formula is not what we are interested in; these objects are compact and have nice algebraic properties.
When we count things, we just care about how many objects there are that satisfies a certain property.
When we count all words of, say, size 5 in a language, we don't care whether these strings are 000,0
or 0,0,0
. The ordering
of the letters in these strings are extraneous details that we no longer care about. Therefore, it would be nice to be able to
forget these details. More formally, if the order of letters in a word doesn't matter, we would say that
we want the concatenation operator to be commutative. If there's a representational equivalence to the numerical "field",
then the translation would be that we want the multiplication operator to be commutative.
This is a huge game-changer. In the above example, we weren't able to fully simplify that ugly product of fractions precisely because we lacked this crucial axiom. Luckily for us, it now allows us to fully simplify the expression
Which tells us that our regular expression is isomorphic to the regular expression . That is, for each comma-separated list, you can map it to one of the words in . In fact, not only are these two languages isomorphic; they are the same! A moment of thought reveals that this new regular expression also matches only comma-separated list of sequences as well.That's a pretty cool trick to deduce equivalences between regular expressions, but is that all there is to it?
It turns out that each of these translated numerical expressions also admit an infinite series expansion (in terms of its free variables). So
and in general, we have the multivariable expansion where is the coefficient attached to the term.However, recall that each of the corresponds to exactly one of the words in our language. Therefore, if there are 5 words of size 6 with just one comma in our language, the coefficient in front of in the series expansion must be 5.
Herein lies the key to our approach. Once we grant the freedom of commutativity, each of these regular expressions "generates"
a numerical function with some infinite series expansion. The coefficients of the term in this
expansion is then the total count of all objects in this regular language that has i
s, j
s, and k
s.
This approach is called the generating function approach within elementary combinatorics. It is a powerful idea to create these compact analytical (if a bit nonsensical) representations of your combinatorial objects of interest in order to use more powerful analytical tools to find properties about them.
We know that there's a translation for our regular expression
into some numerical field. We also know that this numerical formula admits a two-variable infinite series expansion. The task at hand now is one familiar to most students of complex analysis: coefficient extraction. Given a function , how are we going to find the coefficients of ?Before we tackle that beast, let's develop some more intuition about the functions that we will be working with. In general combinatorics, you may face complicated functions using an exotic variety of functions, differential forms, and even implicit functions that can't be expressed in some explicit form. So where do regular expressions sit on this spectrum?
As it turns out, things are much nicer with regular expression (part of the reason they are called "regular"; their regularities ensure that their algebraic properties are easier to analyze than general unbounded constructions). In particular if a regular expression has a translation
then we know for a fact that is rational. What this means is that there's some pair of **polynomials** such that The proof of this fact will be included in the appendix for interested readers, however that proof does not contribute much here. Polynomials are interesting in the context of infinite expansions. Since polynomials are already in the form their infinite expansions are in fact finite. Now, the same cannot be said of , but a bit of algebra shows that the series expansion of this inverse is also computable:This form is particularly amenable for coefficient extraction, and a memoized version of this sits at the heart of the validation algorithm we use to test that the algebra for everything else is done correctly. See the appendix for a derivation of the dynamic program that can turn this into a somewhat fast coefficient extraction algorithm.
Now, up to now, we've been talking about multivariable functions . This makes sense since we need to parameterize our model on each of the letters in our alphabet (oh boy). In general however, multivariable coefficient extraction problems are prohibitively difficult. Not only that, the numerical tools needed to compute saddle-points are outside the scope of this toy project. For more on general methods of multivariate enumeration techniques, check out ACVS.
The situation isn't so bleak within the rational-function realm however, and while there is a straightforward extension of the traditional coefficient-extraction technique to multivariable rational-functions, I just never got to it. See Stoutemyer08 for a brief summary of the multivariate partial-fraction decomposition method. Just know that this isn't supported currently.
Instead, we will only support the class of enumeration problems that counts the total number of words of a certain (singular) size in some language family. The trick here is to turn a blind eye on the fact that and are different variables. In order to do this, we just set them both equal to some other variable . Therefore, the Fibonnacci generating function above
Now that we have a univariate rational function of the form
where are mutually irreducible (that is, there isn't some other polynomial that evenly divides both and ).There's a concept within polynomial algebra known as a partial fraction decomposition. This decomposition theorem tells us that
where (the variety) is the set of roots of and is its multiplicity.So for example, the rational function has the partial fraction decomposition of
no matter what is. To solve for , you can exploit the fact that expanding the numerator and setting them equal to will give you a linear system to solve. The details of how we are going to solve this linear system doesn't matter, it'll be taken care of for you under the hood by `numpy`.Now, how does the partial fraction decomposition help us? Recall that
and in general (by way of the binomial theorem) which means that if then the coefficients on isBam! Closed form expression for any arbitrary regular expression!
While this might seem super complicated, at the heart of this method, we're just using a very well-known method to expand a rational function. This is in part why the functional part of this project that deals with computing this closed form is only a couple of lines long. It's actually a really simple idea.
Let's come back to our favorite example once more. Given the regular expression , we know that it has the generating function
where are the roots of the quadratic equation . We know that this admits a partial fraction decomposition of therefore and . Solving this linear system will yield the coefficients and , and you'll find thatIn addition, if you plot the generating function (as is in picture at the top of this page):
you'll find that the singularities (the points where the graph suddenly jumps up and forms an infinitely tall column) are located exactly at where the roots of are found. This isn't surprising, since by the fact that is irreducible, the roots of the denominator must be non-removable singularities! In fact, if all you cared about is the asymptotic exponential behavior, then there's a simple graphical method to compute the asymptotic complexity of enumerating your regular expression. Take to be the root of the denominator that is closest to the origin on the complex plane, then In addition, if you can figure out the multiplicity of (repeatedly divide out until that column disappears), you can get an exact asymptotic characterizationIt's definitely not too difficult to compute all of this by hand, but the math is really tedious and error prone. This is why this library exists: it automates away the boring parts. In particular, nothing really complicated is going on here.
regex_enumerate.parse
has a Shunting-Yard style stack-based parser to convert a regular expression into a regex tree.regex_enumerate.transfer
translates a regex tree into its equivalent numerical expression tree. In addition, it includes several algorithms for computations on polynomial rings and can simplify any induced numerical expression into a canonical form of
In general, it is not immediately obvious that the expression trees created from regular expressions are rational functions. However, if we already know (ahead of time) that a function is rational, we can rationalize the expression into the form
(where and are not necessarily irreducible) through a pair of mutually inductive reductions.Suppose that the language of regular expressions of unreduced numerical expressions is given by
we would like to reduce an arbitrary expression into some where are straightforward polynomials.To do this, let's start by defining the canonical class of polynomial expressions :
and define the canonical form of , where
To construct this reduction , we need another inductive class of the simple ring of polynomials, , defined by
with an associated reduction operator that simplifies arithmetic on polynomials.Now, let us give the inductive definition of the reduction relations:
This pair of reduction rules are implemented in regex_enumerate.transfer.down_r
and regex_enumerate.transfer.down_p
respectively.
Theorem: The translation given by
only generates rational functions.Proof: By structural induction on .
-
. Now, by the induction hypothesis, both and are rational. Since the sum of two rational functions is still rational, so too is .
-
. Again, by the induction hypothesis, both and are rational. Since the product of two rational functions is still rational, so too is .
-
. Again, we know that is rational by the induction hypothesis. As a result, is also rational, and the inverse of a rational function is still rational as long as that function is not the zero function . While it is possible to construct zero via , they are inherently ambiguous, and hence not in the proper domain of our analysis. Therefore, is rational.
Since this covers all cases of , it must be the case that is rational.
As mentioned before, we'll give an algorithm to compute the exact enumeration problem in cubic time. In essence, the input consists of three polynomials
where without loss of generality, is irreducible. We will consider the problem of finding , which is the coefficient of in the infinite series expansion of . In particular, since it's easy to compute since it is a polynomial, we will instead focus on the problem of computing . Furthermore, we can apply an identity transformation and consider There are a few ways to do this. An easy way is to let , hence By the binomial theorem, we know that so Since we can compute from the solution of , computing this product forms the basis of our dynamic program. We will in turn focus on the problem of computing where and are subproblems. Computing in turn requires time.Yes. In fact, one common transformation people do on regular expressions is compiling them down into some deterministic finite automata. A DFA is inherently ambiguity free since every path through a DFA must correspond to a different word in the language; otherwise there's some state such that it can transition to two different states on the same input, which is impossible since DFAs are, well, deterministic. This then solves the problem of having to specify an unambiguous grammar: all DFAs are unambiguous.
A DFA is a triple where gives the graph underlying the DFA and labels each edge with (zero or more) letters it can transition on.
Given a DFA, we can also solve the same problem. Suppose that if , then we can construct a matrix
where is the incidence matrix of (transposed), but tracking the in-degree instead of just whether is an edge.Then, we can calculate the count of all -letter words in induced by the DFA via
where is the first column of the identity. Given an eigenvalue decomposition, this will then reduce to solving a linear system fittingThis has several advantages, but numerical enumerations of the eigenvalues of a linear system is particularly sensitive to any perturbations. Nevertheless, this is an elegant approach that solves one of the biggest issues of the current technique, and all without doing more than simple linear algebra (plus NFA determinization, which is hard).
-
(00*1)*
: 1-separated strings that starts with 0 and ends with 1Its generating function is
For words of sizes up to 20 in this language, their counts are:1, 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584
Its closed form is
A list of OEIS entries that contains this subsequence.
- Fibonacci numbers: https://oeis.org/A000045
- Pisot sequences E(3,5), P(3,5): https://oeis.org/A020701
- Fibonacci numbers whose decimal expansion does not contain any digit 0: https://oeis.org/A177194
- Expansion of (1-x)/(1-x-x^2): https://oeis.org/A212804
- Pisot sequence E(2,3): https://oeis.org/A020695
- Least k such that the maximum number of elements among the continued fractions for k/1, k/2, k/3, k/4 : https://oeis.org/A071679
- a(n) = Fibonacci(n) mod n^3: https://oeis.org/A132636
- Expansion of 1/(1 - x - x^2 + x^18 - x^20): https://oeis.org/A185357
- Numbers generated by a Fibonacci-like sequence in which zeros are suppressed: https://oeis.org/A243063
- Fibonacci numbers Fib(n) whose decimal expansion does not contain any digit 6: https://oeis.org/A177247
-
(%|1|11)(00*(1|11))*0* | 1
: complete 1 or 11-separated stringsIts generating function is
For words of sizes up to 20 in this language, their counts are:1, 3, 4, 7, 13, 24, 44, 81, 149, 274, 504, 927, 1705, 3136, 5768, 10609, 19513, 35890, 66012, 121415
Its closed form is
A list of OEIS entries that contains this subsequence.
- Tribonacci numbers: https://oeis.org/A000073
-
(000)*(111)*(22)*(33)*(44)*
: complex root toIts generating function is
For words of sizes up to 20 in this language, their counts are:1, 0, 3, 2, 6, 6, 13, 12, 24, 24, 39, 42, 63, 66, 96, 102, 138, 150, 196, 210
Its closed form is
A list of OEIS entries that contains this subsequence.
-
1*(22)*(333)*(4444)*(55555)*
: number of ways to make change give coins of denomination 1 2 3 4 and 5Its generating function is
For words of sizes up to 20 in this language, their counts are:1, 1, 2, 3, 5, 7, 10, 13, 18, 23, 30, 37, 47, 57, 70, 84, 101, 119, 141, 164
Its closed form is
A list of OEIS entries that contains this subsequence.
- Number of partitions of n into at most 5 parts: https://oeis.org/A001401
- Number of partitions of n in which the greatest part is 5: https://oeis.org/A026811
-
11* 22* 33* 44* 55*
: 5 compositions of nIts generating function is
For words of sizes up to 20 in this language, their counts are:0, 0, 0, 0, 0, 1, 5, 15, 35, 70, 126, 210, 330, 495, 715, 1001, 1365, 1820, 2380, 3060
Its closed form is
A list of OEIS entries that contains this subsequence.
- Binomial coefficient binomial(n,4) = n*(n-1)(n-2)(n-3)/24: https://oeis.org/A000332
-
(11*)*
: all compositions of nIts generating function is
For words of sizes up to 20 in this language, their counts are:1, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144
Its closed form is
A list of OEIS entries that contains this subsequence.
- Powers of 2: https://oeis.org/A000079
- Expansion of (1-x)/(1-2*x) in powers of x: https://oeis.org/A011782
- Zero followed by powers of 2 (cf: https://oeis.org/A131577
- Powers of 2, omitting 2 itself: https://oeis.org/A151821
- Orders of finite Abelian groups having the incrementally largest numbers of nonisomorphic forms (A046054): https://oeis.org/A046055
- a(n) = floor(2^|n-1|/2): https://oeis.org/A034008
- Smallest exponent such that -1+3^a(n) is divisible by 2^n: https://oeis.org/A090129
- Pisot sequences E(4,8), L(4,8), P(4,8), T(4,8): https://oeis.org/A020707
- Numbers n such that in the difference triangle of the divisors of n (including the divisors of n) the diagonal from the bottom entry to n gives the divisors of n: https://oeis.org/A273109
- a(n)=2*A131577(n): https://oeis.org/A155559
-
(.........................)* (..........)* (.....)* (.)*
: number of ways to make n cents with US coins.Its generating function is
For words of sizes up to 20 in this language, their counts are:1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6
Its closed form is
A list of OEIS entries that contains this subsequence.
- Highest minimal distance of any Type I (strictly) singly-even binary self-dual code of length 2n: https://oeis.org/A105674
- Number of ways of making change for n cents using coins of 1, 5, 10, 25 cents: https://oeis.org/A001299
- Number of ways of making change for n cents using coins of 1, 5, 10, 25, 50 cents: https://oeis.org/A001300
- Number of ways of making change for n cents using coins of 1, 5, 10, 25, 50 and 100 cents: https://oeis.org/A169718
- Number of ways of making change for n cents using coins of 1, 5, 10, 20, 50, 100 cents: https://oeis.org/A001306
- Repetition of even numbers, with initial zeros, five times: https://oeis.org/A130496
- Number of ways of making change for n cents using coins of 1, 5, 10 cents: https://oeis.org/A187243
- Coefficients of the mock theta function chibar(q): https://oeis.org/A260984
-
(00*1)*00*
: list of 0-sequencesIts generating function is
For words of sizes up to 20 in this language, their counts are:0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181
Its closed form is
A list of OEIS entries that contains this subsequence.
- Fibonacci numbers: https://oeis.org/A000045
- Pisot sequences E(3,5), P(3,5): https://oeis.org/A020701
- Expansion of (1-x)/(1-x-x^2): https://oeis.org/A212804
- Pisot sequence E(2,3): https://oeis.org/A020695
- Least k such that the maximum number of elements among the continued fractions for k/1, k/2, k/3, k/4 : https://oeis.org/A071679
- a(n) = Fibonacci(n) mod n^3: https://oeis.org/A132636
- Expansion of 1/(1 - x - x^2 + x^18 - x^20): https://oeis.org/A185357
- Nearly-Fibonacci sequence: https://oeis.org/A264800
- Pisot sequences E(5,8), P(5,8): https://oeis.org/A020712
- a(n) = s(1)t(n) + s(2)t(n-1) + : https://oeis.org/A024595