Skip to content

Inconsistency with is_match and Python's search in Matching Specific Regex Patterns #1193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
CodyPubNub opened this issue May 9, 2024 · 6 comments
Labels

Comments

@CodyPubNub
Copy link

What version of regex are you using?

I am using Rust 1.78.0 with the regex crate version 1.10.4.

Describe the bug at a high level.

There is a discrepancy in regex pattern matching between Python's re module and Rust's regex crate. The same regex pattern, when tested in Python, matches all intended strings. However, in Rust, the pattern fails to match these strings.

What are the steps to reproduce the behavior?

Here is a complete Rust program that reproduces the behavior:

use regex::Regex;

fn main() {
    let pattern = "(?:private|group)[_[\\w\\d]*]?_abc1d2345678ef90ab3c4567890defab[_[\\w\\d]*]?";
    let compiled = Regex::new(pattern).unwrap();

    let test_haystacks = vec![
        "private_x9z45678abc12345d6e7890f123ghijk_abc1d2345678ef90ab3c4567890defab",
        "private_x9z45678abc12345d6e7890f123ghijk_abc1d2345678ef90ab3c4567890defab___[[[aaa111]",
        "private[_0f4f790_abc1d2345678ef90ab3c4567890defab",
    ];

    for test_haystack in &test_haystacks {
        match compiled.is_match(test_haystack) {
            true => println!("PASS: {}", test_haystack),
            false => eprintln!("FAIL: {}", test_haystack),
        }
    }
}

What is the actual behavior?

The actual output of the Rust program indicates failures where the regex pattern does not match the test strings. For example:

FAIL: private_x9z45678abc12345d6e7890f123ghijk_abc1d2345678ef90ab3c4567890defab
FAIL: private_x9z45678abc12345d6e7890f123ghijk_abc1d2345678ef90ab3c4567890defab___[[[aaa111]
FAIL: private[_0f4f790_abc1d2345678ef90ab3c4567890defab

What is the expected behavior?

I expect the Rust program's output to match the behavior observed in Python, where all provided test strings successfully match the regex pattern.

Additional Context

Below is the corresponding Python code that behaves as expected with the same regex pattern:

import re
compiled = re.compile('(?:private|group)[_[\\w\\d]*]?_abc1d2345678ef90ab3c4567890defab[_[\\w\\d]*]?')
assert compiled.search('private_x9z45678abc12345d6e7890f123ghijk_abc1d2345678ef90ab3c4567890defab')
assert compiled.search('private_x9z45678abc12345d6e7890f123ghijk_abc1d2345678ef90ab3c4567890defab___[[[aaa111]')
assert compiled.search('private[_0f4f790_abc1d2345678ef90ab3c4567890defab')

This can also be seen working as expected on https://regexr.com/

@stephenlb
Copy link

Good find! Even linking to the RegEx standard showing that it works using the documented reference 👍

@stephenlb
Copy link

Hope we can get this fixed soon 🔜🤞

@BurntSushi
Copy link
Member

Your regex is kinda messed up. Specifically, this part (which is repeated):

[_[\w\d]*]?

Python regexes don't support nested character classes unlike the regex crate. And because Python's regex engine follows the tradition of context dependent escaping rules, meta characters like ] are treated literally when used in a context in which they cannot possibly have any special significance. But, as can be seen in this case, it makes the regex quite deceptive. Here's a better way to write the same part of the pattern:

[_\[\w\d]*\]?

And indeed, using that with the regex crate produces the desired result:

use regex::Regex;

fn main() {
    let pattern = r"(?:private|group)[_\[\w\d]*\]?_abc1d2345678ef90ab3c4567890defab[_\[\w\d]*\]?";
    let compiled = Regex::new(pattern).unwrap();

    let test_haystacks = vec![
        "private_x9z45678abc12345d6e7890f123ghijk_abc1d2345678ef90ab3c4567890defab",
        "private_x9z45678abc12345d6e7890f123ghijk_abc1d2345678ef90ab3c4567890defab___[[[aaa111]",
        "private[_0f4f790_abc1d2345678ef90ab3c4567890defab",
    ];

    for test_haystack in &test_haystacks {
        match compiled.is_match(test_haystack) {
            true => println!("PASS: {}", test_haystack),
            false => eprintln!("FAIL: {}", test_haystack),
        }
    }
}

(I also switched to using raw strings via r"..." so that you don't need to do double escaping.)

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024
@BurntSushi
Copy link
Member

Good find! Even linking to the RegEx standard showing that it works using the documented reference 👍

This isn't a bug and there is no requirement that this crate matches Python's regex engine in all cases. There's also no regex standard at play here (governing either Python's or Rust's regex engine).

@CodyPubNub
Copy link
Author

Hi @BurntSushi 👋

I don't consider this issue invalid.

I'm not in a position to change the un-compiled regular expressions as they are provided by end users, and if they're compilable, which they are, they are expected to be searchable.

Do you have any particular guidance toward a solution for compatibility?

@BurntSushi
Copy link
Member

I don't know what you mean by your assertion that they are "compatible."

There is literally an unbounded number of ways in which Python regexes are different than Rust regexes. And this generally applies to all pairs of regex engines unless they very strictly follow a standard. (Of which, generally speaking, only two are prevalent: POSIX and ECMA. Neither Python's regex engine nor Rust's regex engine follow either one.)

I don't consider this issue invalid.

I want to be clear here that this issue is definitively invalid within the scope of this project. That doesn't mean you don't have a problem. You might have a problem on your end where you have a pile of regexes that worked with one regex engine and need to use them, unchanged, with some other regex engine. But that isn't really a problem I can help with and is in general not a problem that can be easily solved for any two regex engines. (Unless your patterns happen to incidentally behave the same, or as I mentioned above, the regex engines strictly adhere to an existing standard.)

Do you have any particular guidance toward a solution for compatibility?

Well... of course not. Because I don't really know the structure of the problem you're trying to solve. All that's been presented to me here is a regex that works one way in Python and a seeming request to have it work the same way in Rust. But that will definitively not happen. As far as solving your problem in a different way, I don't know because I don't know what problem you're trying to solve. If, for example, these regexes are provided by end users and you've promised that the regex syntax is equivalent to whatever Python supports, then you need to use a regex engine with the goal of compatibility with Python's regex engine. (Of which, I believe only one exists. The re module in Python's standard library. The third party regex Python package on PyPI might also have enough compatibility to work for you.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants