Skip to content

Word characters for search index should NOT include underscore #11253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rogerbinns opened this issue Mar 19, 2023 · 1 comment
Open

Word characters for search index should NOT include underscore #11253

rogerbinns opened this issue Mar 19, 2023 · 1 comment

Comments

@rogerbinns
Copy link

Describe the bug

When the search index is built, the text is split into words. This is done with the regex \w+ - line 85. \w includes characters, but also includes underscores and should not.

The consequence is that if your text has word1_word2_word3 then doing a search for word2 or word3 will not find that match. Underscore seperated words are common in Python and elsewhere, and the Javascript tokenizer does consider underscore as a splitter.

I experienced this with SQLITE_CONFIG_URI being in my doc, but searches for uri do not find it.

How to Reproduce

You can see this with Sphinx's own doc. Search for apply_source_workaround and you'll see apply_source_workaround found. Now search for workaround and apply_source_workaround is not found at all.

Environment Information

Platform:              linux; (Linux-5.19.0-31-generic-x86_64-with-glibc2.36)
Python version:        3.10.7 (main, Nov 24 2022, 19:45:47) [GCC 12.2.0])
Python implementation: CPython
Sphinx version:        6.1.3
Docutils version:      0.19
Jinja2 version:        3.1.2
Pygments version:      2.14.0

Sphinx extensions

No response

Additional context

No response

@rogerbinns
Copy link
Author

I did a quick local hack that split what the regex returned on underscore and it fixed the issue for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants