Skip to content

feature request: extend unicode support with full case folding #1260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ReinierMaas opened this issue Apr 8, 2025 · 2 comments
Closed

feature request: extend unicode support with full case folding #1260

ReinierMaas opened this issue Apr 8, 2025 · 2 comments
Labels

Comments

@ReinierMaas
Copy link

XRef: #428, https://docs.rs/regex/1.11.1/regex/index.html#unicode

According to the documentation:

Case insensitive searching is Unicode-aware and uses simple case folding.

I would like to request full case folding support.

Why do I need full case folding?

We are currently using libicu which supports full case folding. We have tested using regex on our datasets and full case folding turns out to be a requirement for the migration towards regex from libicu.

Do you need help (to implement this feature)?

I understand if you don't want to enable this by default, working towards a feature flag would be a workable solution from our point of view. We can work on the feature if the regex team would be open to having it.

We didn't make any progress towards implementing this feature, i.e. no patch is lying around.

Example

Regex (?i)sss with full case folding enabled would match:

  • sss
  • ßs

Feel free to inquire for additional information if something is missing from the request.

@BurntSushi
Copy link
Member

This ain't happening, sorry. I'm surprised that you're offering to work on this, because this is likely something that is completely impractical to implement in a finite automaton regex engine (like this crate is).

Indeed, it's so impractical that the part of UTS#18 specifying full case folding was retracted some time ago. See https://unicode.org/reports/tr18/#Default_Loose_Matches and https://unicode.org/reports/tr18/#Canonical_Equivalents

UTS#18 offers an out though. Instead, what you should do is normalize both your pattern and your haystack using full default case folding (for which there are crates to do it).

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Apr 8, 2025
@BurntSushi
Copy link
Member

BurntSushi commented Apr 8, 2025

If I'm wrong about its difficultly and you do end up implementing this in a fork, please feel free to re-open this and we can explore what it would take to upstream it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants