evtn/rgx

Many people complain about unreadable and complex syntax of regular expressions.
Many others complain about how they can't remember all constructs and features.

rgx solves those problems: it is a straightforward regexp builder. It also places non-capturing groups where needed to respect intended operator priority.
It can produce a regular expression string to use in re.compile or any other regex library of your choice.

In other words, with rgx you can build a regular expression from parts, using straightforward and simple expressions.

Installation

pip install rgx

That's it.

Basic usage

Hello, regex world

from rgx import pattern, meta
import re

separator = meta.WHITESPACE.some() + (meta.WHITESPACE | ",") + meta.WHITESPACE.some()

# matches "hello world", "hello, world", "hello            world", "hello,world", "hello ,  world"
hello_world = pattern((
    "hello",
    separator,
    "world"
)) # (?:hello(?:\s)*(?:\s|,)(?:\s)*world)

re.compile(
    hello_world.render_str("i") # global flag (case-insensitive)
)

Match some integers

this regex will match valid Python integer literals:

from rgx import pattern
import re

nonzero = pattern("1").to("9") # [1-9]
zero = "0"
digit = zero | nonzero # [0-9]
integer = zero | (nonzero + digit.some()) # 0|[1-9][0-9]*

int_regex = re.compile(str(integer))

...or this one:

from rgx import pattern, meta
import re

nonzero = pattern("1").to("9") # [1-9]
digit = meta.DIGIT # \d
integer = digit | (nonzero + digit.some()) # \d|[1-9]\d*

int_regex = re.compile(str(integer))

Quickstart

in this readme, x means some pattern object. Occasionaly, y is introduced to mean some other pattern object (or literal)

Literals and pattern objects

rgx operates mostly on so-called "pattern objects" — rgx.entities.RegexPattern istances.
Your starting point would be rgx.pattern — it creates pattern objects from literals (and from pattern objects, which doesn't make a lot of sense).

rgx.pattern(str, escape: bool = True) creates a literal pattern — one that exactly matches given string. If you want to disable escaping, pass escape=False
rgx.pattern(tuple[AnyRegexPattern]) creates a non-capturing group of patterns (nested literals will be converted too)
rgx.pattern(list[str]) creates a character class (for example, rgx.pattern(["a", "b", "c"]) creates pattern [abc], that matches any character of those in brackets)
- Same can be achieved by rgx.pattern("a").to("c") or rgx.pattern("a") | "b" | "c"

Most operations with pattern objects support using Python literals on one side, for example: rgx.pattern("a") | b would produce [ab] pattern object (specifically, rgx.entities.Chars)

Rendering patterns

from rgx import pattern

x = pattern("one")
y = pattern("two")
p = x | y

rendered_with_str = str(p) # "one|two"
rendered_with_method = p.render_str() # "one|two"
rendered_with_method_flags = p.render_str("im") # (?im)one|two

Capturing Groups

from rgx import pattern, reference, named

x = pattern("x")

print(x.capture()) # (x)

print(reference(1)) # \1


named_x = x.named("some_x") # x.named(name: str)

print(named_x) # (?P<some_x>x)

named_x_reference = named("some_x")

print(named_x_reference) # (?P=x)

To create a capturing group, use x.capture(), or rgx.reference(group: int) for a reference.
To create a named capturing group, use rgx.named(name: str, x), or rgx.named(name: str) for a named reference.

Character classes

from rgx import pattern, meta


az = pattern("a").to("z") # rgx.Chars.to(other: str | Literal | Chars)
print(az) # [a-z]

digits_or_space = pattern(["1", "2", "3", meta.WHITESPACE])
print(digits_or_space) # [123\s]

print(az | digits_or_space) # [a-z123\s]


print( # rgx.Chars.reverse(self)
    (az | digits_or_space).reverse() # [^a-z123\s]
)

Excluding characters

If you have two instances of Chars (or compatible literals), you can exclude one from another:

from rgx import pattern

letters = pattern("a").to("z") | pattern("A").to("Z") # [A-Za-z]
vowels = pattern(list("aAeEiIoOuU")) # [AEIOUaeiou]
consonants = letters.exclude(vowels) # [BCDFGHJ-NP-TV-Zbcdfghj-np-tv-z]

Conditional pattern

from rgx import pattern, conditional

x = pattern("x")
y = pattern("y")
z = pattern("z")

capture = x.capture()

# (x)(?(1)y|z)
print(
    capture + conditional(1, y, z)
)

Repeating patterns

If you need to match a repeating pattern, you can use pattern.repeat(count, lazy):

a = pattern("a")

a.repeat(5)                      # a{5}
# or
a * 5                            # a{5}, multiplication is an alias for .repeat

a.repeat(5).or_more()            # a{5,}
a.repeat(5).or_less()            # a{,5}

a.repeat_from(4).to(5)           # a{4, 5}, .repeat_from is just an alias for .repeat
# or
a.repeat(4) >> 5                 # a{4, 5}

a.repeat(1).or_less()            # a?
# or
-a.repeat(1)                     # a?
# or
a.maybe()                        # a?

a.repeat(1).or_more()            # a+
# or
+a.repeat(1)                     # a+
# or
+a                               # a+
# or
a.many()                         # a+

a.repeat(0).or_more()            # a*
# or
+a.repeat(0)                     # a*
# or
a.some()                         # a*
# or (what)
+-(a * 38)                       # a*

Here's what's going on:
pattern.repeat(count, lazy) returns a {count, count} Range object
pattern * count is the same as pattern.repeat(count, False)

Range implements or_more, or_less and to methods:

Range.or_more() [or +Range] moves (on a copy) upper bound of range to infinity (actually None)
Range.or_less() [or -Range] moves (on a copy) lower bound of range to 0
Range.to(count) [or Range >> count (right shift)] replaces upper bound with given number

Also, RegexPattern implements unary plus (+pattern) as an alias for pattern.many()

Docs

Pattern methods

`pattern.render_str(flags: str = '') -> str`

Renders given pattern into a string with specified global flags.

`pattern.set_flags(flags: str) -> LocalFlags`

This method adds local flags to given pattern

x.flags("y") # "(?y:x)"

`pattern.concat(other: AnyRegexPattern) -> Concat`

Use to match one pattern and then another.

A.concat(B) is equivalent to A + B (works if either A or B is a RegexPart object, not a Python literal)

x.concat(y) # "xy"
x + y # "xy"

`pattern.option(other: AnyRegexPattern) -> Chars | ReversedChars | Option`

Use to match either one pattern or another.

A.option(B) is equivalent to A | B (if either A or B is a RegexPart object, not a Python literal)

x.option(y) # "x|y"
x | y # "x|y"