Skip to content

English stemming problems #13535

Closed
Closed
@ojwb

Description

@ojwb

Describe the bug

Hi, snowballstem upstream here.

The recent issue with our 3.0.0 release caused me to notice that you're using our "porter" stemmer, which is really still provided only for academic interest. It aims to be a faithful implementation of Martin Porter's English stemmer as described in his 1980 paper, and may be useful to people trying to reproduce past results which used it. This means that the implementation of "porter" is effectively frozen (we'd only fix deviations from the original paper). In the 45 years since the paper numerous shortcomings in the algorithm it describes have come to light, and Martin himself has since devised an improved version of the stemmer, which he nicknamed "porter2". You can find the lastest version of this as our "english" stemmer.

Looking at https://github.com/sphinx-doc/sphinx/blob/master/sphinx/search/en.py I see there's also a "JS porter" implementation which looks like a hand-written implementation, and my initial thought that "oh, they're using the older and less good porter stemmer because they need to be compatible with that", but looking at the "JS porter" javascript code, it isn't actually an implementation of Porter's 1980 algorithm. For example, line 48 has logi: 'log' which is one of the additional rules added in "porter2". My best guess is it's a hand-written implementation of an early version of "porter2".

If I follow how this is being used, you index with the Python "porter" stemmer and search with this Javascript "JS porter" stemmer. If that's correct, searches for some words will fail to match the same word in documentation (e.g. a search for tautology won't match tautology in the documentation because "porter" will stem it to tautologi while "JS porter" will stem it to tautolog).

I extracted this "JS porter" code to actually verify this, and ran it against Snowball's test suite, which reveals more problems. For example, its undoubling rule is buggy and it stems wrapped to wrapp, while both "porter" and "english" stem it to "wrap", so it's a buggy implementation compared to either. That means a query for wrapped will fail to match wrapped in a document.

In total 1148 words from our English test vocabulary of 42603 words are stemmed differently by "porter" and "JS porter" - that's 2.7% (counting each word equally rather than trying to weight by frequency, but maybe 1 word in 40 will not match as it should).

However, 5324 out of 42621 words are stemmed differently by "english" (from Snowball 3.0.0) and "JS porter" so that's worse (at least if we assume all words are equally important).

(The 42621 vs 42603 word list size difference is just because the word list for "english" has had a few extra words added over that for "porter" to provide better test coverage for some rule changes.)

I'd suggest the best way to resolve this would be to switch from "porter" to "english" (because the latter has improvements from 45 years of experience using the original so is a significantly better stemmer) and replace this "JS porter" implementation with a Javascript version of the same stemmer generated by Snowball. Snowball's upstream testsuite should ensure these produce the same stems (at least if you take them from the same upstream Snowball release, but evolution is slow at this point so even version skew is not going to give you different stems for 2.7% of words).

It looks like you even already have Snowball-generated Javascript versions for many languages, though they're rather out of date (Snowball 2.1.0 was released 2021-01-21):

https://github.com/sphinx-doc/sphinx/blob/master/sphinx/search/non-minified-js/danish-stemmer.js

How to Reproduce

Compare stems for e.g. wrapped from the Python and Javascript code.

Environment Information

Report based on inspecting code in git.

Sphinx extensions

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions