Description
Describe the bug
Hi, snowballstem upstream here.
The recent issue with our 3.0.0 release caused me to notice that you're using our "porter" stemmer, which is really still provided only for academic interest. It aims to be a faithful implementation of Martin Porter's English stemmer as described in his 1980 paper, and may be useful to people trying to reproduce past results which used it. This means that the implementation of "porter" is effectively frozen (we'd only fix deviations from the original paper). In the 45 years since the paper numerous shortcomings in the algorithm it describes have come to light, and Martin himself has since devised an improved version of the stemmer, which he nicknamed "porter2". You can find the lastest version of this as our "english" stemmer.
Looking at https://github.com/sphinx-doc/sphinx/blob/master/sphinx/search/en.py I see there's also a "JS porter" implementation which looks like a hand-written implementation, and my initial thought that "oh, they're using the older and less good porter stemmer because they need to be compatible with that", but looking at the "JS porter" javascript code, it isn't actually an implementation of Porter's 1980 algorithm. For example, line 48 has logi: 'log'
which is one of the additional rules added in "porter2". My best guess is it's a hand-written implementation of an early version of "porter2".
If I follow how this is being used, you index with the Python "porter" stemmer and search with this Javascript "JS porter" stemmer. If that's correct, searches for some words will fail to match the same word in documentation (e.g. a search for tautology
won't match tautology
in the documentation because "porter" will stem it to tautologi
while "JS porter" will stem it to tautolog
).
I extracted this "JS porter" code to actually verify this, and ran it against Snowball's test suite, which reveals more problems. For example, its undoubling rule is buggy and it stems wrapped
to wrapp
, while both "porter" and "english" stem it to "wrap", so it's a buggy implementation compared to either. That means a query for wrapped
will fail to match wrapped
in a document.
In total 1148 words from our English test vocabulary of 42603 words are stemmed differently by "porter" and "JS porter" - that's 2.7% (counting each word equally rather than trying to weight by frequency, but maybe 1 word in 40 will not match as it should).
However, 5324 out of 42621 words are stemmed differently by "english" (from Snowball 3.0.0) and "JS porter" so that's worse (at least if we assume all words are equally important).
(The 42621 vs 42603 word list size difference is just because the word list for "english" has had a few extra words added over that for "porter" to provide better test coverage for some rule changes.)
I'd suggest the best way to resolve this would be to switch from "porter" to "english" (because the latter has improvements from 45 years of experience using the original so is a significantly better stemmer) and replace this "JS porter" implementation with a Javascript version of the same stemmer generated by Snowball. Snowball's upstream testsuite should ensure these produce the same stems (at least if you take them from the same upstream Snowball release, but evolution is slow at this point so even version skew is not going to give you different stems for 2.7% of words).
It looks like you even already have Snowball-generated Javascript versions for many languages, though they're rather out of date (Snowball 2.1.0 was released 2021-01-21):
https://github.com/sphinx-doc/sphinx/blob/master/sphinx/search/non-minified-js/danish-stemmer.js
How to Reproduce
Compare stems for e.g. wrapped
from the Python and Javascript code.
Environment Information
Report based on inspecting code in git.
Sphinx extensions
Additional context
No response