bielikb / Regex

Open source regex engine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Regex

Open source regex engine.

Warning. Not meant to be used in production, created for learning purposes!
See Let's Build a Regex Engine series to learn how this project came to be.

Usage

Create a Regex object by providing a pattern and an optional set of options (Regex.Options):

let regex = try Regex(#"<\/?[\w\s]*>|<.+[\W]>"#)

The pattern is parsed and compiled to the special internal representation. If there is an error in the pattern, the initializer will throw a detailed error with an index of the failing token and an error message.

Use isMatch(_:) method to check if the regular expression patterns occurs in the input text:

regex.isMatch("<h1>Title</h1>")

Retrieve one or all occurrences text that matches the regular expression by calling matches(in:) method. Each match contains a range in the input string.

for match in regex.matches(in: "<h1>Title</h1>\n<p>Text</p>") {
    print(match.value)
    // Prints ["<h1>", "</h1>", "<p>", "</p>"]
}

If you just want a single match, use regex.firstMatch(in:).

Regex is fully thead safe.

Features

Character Classes

A character class matches any one of a set of characters.

  • [character_group] – matches any single character in character_group, e.g. [ae]
  • [^</b><i>character_group</i><b>] – negation, matches any single character that is not in character_group, e.g. [^ae]
  • [first-last] – character range, matches any single character in the given range from first to last, e.g. [a-z]
  • . – wildcard, matches any single character except \n
  • \w - matches any word character (negation: \W)
  • \s - matches any whitespace character (negation: \S)
  • \d - matches any decimal digit (negation: \D)
  • \z - matches end of string (negation: \Z)
  • \p{name} - matches characters from the given unicode category, e.g. \p{P} for punctuation characters (supported categories: P, Lt, Ll, N, S) (negation: \P)

Characters consisting of multiple unicode scalars (extended grapheme clusters) are interpreted as single characters, e.g. pattern "🇺🇸+" matches "🇺🇸" and "🇺🇸🇺🇸" but not "🇸🇸". But when used inside character group, each unicode scalar is interpreted separately, e.g. pattern "[🇺🇸]" matches "🇺🇸" and "🇸🇸" which consist of the same scalars.

Character Escapes

The backslash (\) either indicates that the character that follows is a special character or that the keyword should be interpreted literally.

  • \keyword – interprets the keyword literally, e.g. \{ matches the opening bracket
  • \special_character – interprets the special character, e.g. \b matches word boundary (more info in "Anchors")
  • \u{nnnn} – matches a UTF-16 code unit, e.g. \u0020 matches escape (Swift-specific feature)

Anchors

Anchors specify a position in the string where a match must occur.

  • ^ – matches the beginning of the string (or beginning of the line when .multiline option is enabled)
  • $ – matches the end of the string or \n at the end of the string (end of the line in .multiline mode)
  • \A – matches the beginning of the string (ignores .multiline option)
  • \Z – matches the end of the string or \n at the end of the string (ignores .multiline option)
  • \z – matches the end of the string (ignores .multiline option)
  • \G – match must occur at the point where the previous match ended
  • \b – match must occur on a boundary between a word character and a non-word character (negation: \B)

Grouping Constructs

Grouping constructs delineate the subexpressions of a regular expression and capture the substrings of an input string.

  • (subexpression) – captures a subexpression in a group
  • (?:subexpression) – non-capturing group

Backreferences

Backreferences provide a convenient way to identify a repeated character or substring within a string.

  • \number – matches the capture group at the given ordinal position e.g. \4 matches the content of the fourth group

If the referenced group can't be found in the pattern, the error will be thrown.

Quantifiers

Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.

  • * – match zero or more times
  • + – match one or more times
  • ? – match zero or one time
  • {n} – match exactly n times
  • {n,} – match at least n times
  • {n,m} – match from n to m times, closed range, e.g. a{3,4}

All quantifiers are greedy by default, they try to match as many occurrences of the pattern as possible. Append the ? character to a quantifier to make it lazy and match as few occurrences as possible, e.g. a+?.

Warning: lazy quantifiers might be used to control which groups and matches are captured, but they shouldn't be used to optimize matcher performance which already uses an algorithm which can handle even nested greedy quantifiers.

Alternation

  • | – match either left side or right side

Options

Regex can be initialized with a set of options (Regex.Options).

  • .caseInsensitive – match letters in the pattern independent of case.
  • .multiline – control the behavior of ^ and $ anchors. By default, these match at the start and end of the input text. If this flag is set, will match at the start and end of each line within the input text.
  • .dotMatchesLineSeparators – allow . to match any character, including line separators.

Not supported Features

  • Most unicode categories are not support, e.g.\p{Sc} (currency symbols) is not supported
  • Character class subtraction, e.g. [a-z-[b-f]]
  • Named blocks, e.g. \p{IsGreek}

Grammar

See Grammar.ebnf for a formal description of the language using EBNF notation. See Grammar.xhtml for a visualization (railroad diagram) of the grammar generated thanks to https://www.bottlecaps.de/rr/ui.

References

License

Regex is available under the MIT license. See the LICENSE file for more info.

About

Open source regex engine


Languages

Language:Swift 100.0%