vlcek / pyre2

Python wrapper for RE2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pyre2

Summary

pyre2 is a Python extension that wraps Google's RE2 regular expression library. The RE2 engine compiles (strictly) regular expressions to deterministic finite automata, which guarantees linear-time behavior.

Intended as a drop-in replacement for re. Unicode is supported by encoding to UTF-8, and bytes strings are treated as UTF-8 when the UNICODE flag is given. For best performance, work with UTF-8 encoded bytes strings.

Backwards Compatibility

The stated goal of this module is to be a drop-in replacement for re, i.e.:

try:
    import re2 as re
except ImportError:
    import re

That being said, there are features of the re module that this module may never have; these will be handled through fallback to the original re module: - lookahead assertions(?!...)- backreferences (\nin search pattern) - \W and \S not supported inside character classes On the other hand, unicode character classes are supported (e.g.,p{Greek}). Syntax reference: https://github.com/google/re2/wiki/Syntax However, there are times when you may want to be notified of a failover. The functionset_fallback_notificationdetermines the behavior in these cases:: try: import re2 as re except ImportError: import re else: re.set_fallback_notification(re.FALLBACK_WARNING)set_fallback_notificationtakes three values:re.FALLBACK_QUIETLY(default),re.FALLBACK_WARNING(raise a warning), andre.FALLBACK_EXCEPTION(raise an exception). Installation ============ Prerequisites: * The `re2 library from Google <https://github.com/google/re2>`_ * The Python development headers (e.g.sudo apt-get install python-dev) * A build environment withgccorclang(e.g.sudo apt-get install build-essential) * Cython 0.20+ (pip install cython) After the prerequisites are installed, install as follows (pip3for python3):: $ pip install https://github.com/andreasvc/pyre2/archive/master.zip For development, get the source:: $ git clone git://github.com/andreasvc/pyre2.git $ cd pyre2 $ make install (ormake install3for Python 3) Documentation ============= Consult the docstring in the source code or interactively through ipython orpydoc re2etc. Unicode Support =============== Pythonbytesandunicodestrings are fully supported, but note thatRE2works with UTF-8 encoded strings under the hood, which means thatunicodestrings need to be encoded and decoded back and forth. There are two important factors: * whether aunicodepattern and search string is used (will be encoded to UTF-8 internally) * theUNICODEflag: whether operators such aswrecognize Unicode characters. To avoid the overhead of encoding and decoding to UTF-8, it is possible to pass UTF-8 encoded bytes strings directly but still treat them asunicode:: In [18]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE) Out[18]: ['M', '\xc3\xb6', 't', 'l', 'e', 'y', 'C', 'r', '\xc3\xbc', 'e'] In [19]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8')) Out[19]: ['M', 't', 'l', 'e', 'y', 'C', 'r', 'e'] However, note that the indices inMatchobjects will refer to the bytes string. The indices of the match in theunicodestring could be computed by decoding/encoding, but this is done automatically and more efficiently if you pass theunicodestring:: >>> re2.search(u'ü'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE) <re2.Match object; span=(10, 12), match='\xc3\xbc'> >>> re2.search(u'ü', u'Mötley Crüe', flags=re2.UNICODE) <re2.Match object; span=(9, 10), match=u'\xfc'> Finally, if you want to match bytes without regard for Unicode characters, pass bytes strings and leave out theUNICODEflag (this will cause Latin 1 encoding to be used withRE2under the hood):: >>> re2.findall(br'.', b'\x80\x81\x82') ['\x80', '\x81', '\x82'] Performance =========== Performance is of course the point of this module, so it better perform well. Regular expressions vary widely in complexity, and the salient feature ofRE2is that it behaves well asymptotically. This being said, for very simple substitutions, I've found that occasionally python's regularremodule is actually slightly faster. However, when theremodule gets slow, it gets *really* slow, while this module buzzes along. In the below example, I'm running the data against 8MB of text from the colossal Wikipedia XML file. I'm running them multiple times, being careful to use thetimeitmodule. To see more details, please see the `performance script <http://github.com/axiak/pyre2/tree/master/tests/performance.py>`_. +-----------------+---------------------------------------------------------------------------+------------+--------------+---------------+-------------+-----------------+----------------+ |Test |Description |# total runs|retime(s)|re2time(s)|%retime|regextime(s)|%regextime| +=================+===========================================================================+============+==============+===============+=============+=================+================+ |Findall URI|Email|Find list of '([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)'|2 |6.262 |0.131 |2.08% |5.119 |2.55% | +-----------------+---------------------------------------------------------------------------+------------+--------------+---------------+-------------+-----------------+----------------+ |Replace WikiLinks|This test replaces links of the form [[Obama|Barack_Obama]] to Obama. |100 |4.374 |0.815 |18.63% |1.176 |69.33% | +-----------------+---------------------------------------------------------------------------+------------+--------------+---------------+-------------+-----------------+----------------+ |Remove WikiLinks |This test splits the data by the <page> tag. |100 |4.153 |0.225 |5.43% |0.537 |42.01% | +-----------------+---------------------------------------------------------------------------+------------+--------------+---------------+-------------+-----------------+----------------+ Feel free to add more speed tests to the bottom of the script and send a pull request my way! Current Status ============== The tests show the following differences with Python'sremodule: * The$operator in Python'srematches twice if the string ends withn. This can be simulated usingn?$, except when doing substitutions. *pyre2and Python'srebehave differently with nested and empty groups;pyre2will return an empty string in cases where Python would return None for a group that did not participate in a match. Please report any further issues withpyre2. Tests ===== If you would like to help, one thing that would be very useful is writing comprehensive tests for this. It's actually really easy: * Come up with regular expression problems using the regular python 're' module. * Write a session in python traceback format `Example <http://github.com/axiak/pyre2/blob/master/tests/search.txt>`_. * Replace yourimport rewithimport re2 as re``. * Save it as a .txt file in the tests directory. You can comment on it however you like and indent the code with 4 spaces.

Credits

This code builds on the following projects (in chronological order):

About

Python wrapper for RE2

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 99.2%Language:Makefile 0.6%Language:C 0.2%