Skip to main content

Python wrapper for Google's RE2 using Cython

Project description

Summary

pyre2 is a Python extension that wraps Google’s RE2 regular expression library.

This version of pyre2 is similar to the one you’d find at facebook’s github repository except that the stated goal of this version is to be a drop-in replacement for the re module.

Backwards Compatibility

The stated goal of this module is to be a drop-in replacement for re. My hope is that some will be able to go to the top of their module and put:

try:
    import re2 as re
except ImportError:
    import re

That being said, there are features of the re module that this module may never have. For example, RE2 does not handle lookahead assertions ((?=...)). For this reason, the module will automatically fall back to the original re module if there is a regex that it cannot handle.

However, there are times when you may want to be notified of a failover. For this reason, I’m adding the single function set_fallback_notification to the module. Thus, you can write:

try:
    import re2 as re
except ImportError:
    import re
else:
    re.set_fallback_notification(re.FALLBACK_WARNING)

And in the above example, set_fallback_notification can handle 3 values: re.FALLBACK_QUIETLY (default), re.FALLBACK_WARNING (raises a warning), and re.FALLBACK_EXCEPTION (which raises an exception).

Note: The re2 module treats byte strings as UTF-8. This is fully backwards compatible with 7-bit ascii. However, bytes containing values larger than 0x7f are going to be treated very differently in re2 than in re. The RE library quietly ignores invalid utf8 in input strings, and throws an exception on invalid utf8 in patterns. For example:

>>> re.findall(r'.', '\x80\x81\x82')
['\x80', '\x81', '\x82']
>>> re2.findall(r'.', '\x80\x81\x82')
[]

If you require the use of regular expressions over an arbitrary stream of bytes, then this library might not be for you.

Installation

To install, you must first install the prerequisites:

  • The re2 library from Google

  • The Python development headers (e.g. sudo apt-get install python-dev)

  • A build environment with g++ (e.g. sudo apt-get install build-essential)

After the prerequisites are installed, you can try installing using easy_install:

$ sudo easy_install re2

if you have setuptools installed (or use pip).

If you don’t want to use setuptools, you can alternatively download the tarball from pypi.

Alternative to those, you can clone this repository and try installing it from there. To do this, run:

$ git clone git://github.com/axiak/pyre2.git
$ cd pyre2.git
$ sudo python setup.py install

If you want to make changes to the bindings, you must have Cython >=0.13.

Unicode Support

One current issue is Unicode support. As you may know, RE2 supports UTF8, which is certainly distinct from unicode. Right now the module will automatically encode any unicode string into utf8 for you, which is slow (it also has to decode utf8 strings back into unicode objects on every substitution or split). Therefore, you are better off using bytestrings in utf8 while working with RE2 and encoding things after everything you need done is finished.

Performance

Performance is of course the point of this module, so it better perform well. Regular expressions vary widely in complexity, and the salient feature of RE2 is that it behaves well asymptotically. This being said, for very simple substitutions, I’ve found that occasionally python’s regular re module is actually slightly faster. However, when the re module gets slow, it gets really slow, while this module buzzes along.

In the below example, I’m running the data against 8MB of text from the collosal Wikipedia XML file. I’m running them multiple times, being careful to use the timeit module. To see more details, please see the performance script.

Test

Description

# total runs

re time(s)

re2 time(s)

% re time

regex time(s)

% regex time

Findall URI|Email

Find list of ‘([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)’

2

19.961

0.336

1.68%

11.463

2.93%

Replace WikiLinks

This test replaces links of the form [[Obama|Barack_Obama]] to Obama.

100

16.032

2.622

16.35%

2.895

90.54%

Remove WikiLinks

This test splits the data by the <page> tag.

100

15.983

1.406

8.80%

2.252

62.43%

Feel free to add more speed tests to the bottom of the script and send a pull request my way!

Current Status

pyre2 has only received basic testing. Please use it and let me know if you run into any issues!

Contact

You can file bug reports on GitHub, or contact the author: Mike Axiak contact page.

Tests

If you would like to help, one thing that would be very useful is writing comprehensive tests for this. It’s actually really easy:

  • Come up with regular expression problems using the regular python ‘re’ module.

  • Write a session in python traceback format Example.

  • Replace your import re with import re2 as re.

  • Save it as a .txt file in the tests directory. You can comment on it however you like and indent the code with 4 spaces.

Missing Features

Currently the features missing are:

  • If you use substitution methods without a callback, a non 0/1 maxsplit argument is not supported.

Credits

Though I ripped out the code, I’d like to thank David Reiss and Facebook for the initial inspiration. Plus, I got to gut this readme file!

Moreover, this library would of course not be possible if not for the immense work of the team at RE2 and the few people who work on Cython.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

re2-0.2.24.tar.gz (1.9 MB view details)

Uploaded Source

Built Distribution

re2-0.2.24.linux-x86_64.tar.gz (380.4 kB view details)

Uploaded Source

File details

Details for the file re2-0.2.24.tar.gz.

File metadata

  • Download URL: re2-0.2.24.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for re2-0.2.24.tar.gz
Algorithm Hash digest
SHA256 4ace380843270a67b7e5f2463a83f264e091392399ae15115544c0d3ab4170af
MD5 c3db337502ae378312c1203c4ad9f0a5
BLAKE2b-256 14b0e3476ed13e60fc7fb0d172926809c53be7a8d02f14191ada449a6d572dcb

See more details on using hashes here.

File details

Details for the file re2-0.2.24.linux-x86_64.tar.gz.

File metadata

  • Download URL: re2-0.2.24.linux-x86_64.tar.gz
  • Upload date:
  • Size: 380.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for re2-0.2.24.linux-x86_64.tar.gz
Algorithm Hash digest
SHA256 f6e8cdc2e9aa3d039ab68e8c8d3c7a33f8b9e2ddb5119db3743d2b631764d283
MD5 85d23b963f29f06089f213030883be52
BLAKE2b-256 15eeb0be0f34057c0c790219b1d99a5c76b03c99bcc1318973fa6041dd45eb25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page