Skip to content

feat: add std/text with word-similarity helpers #3488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 45 commits into from
Nov 5, 2023
Merged

Conversation

jeff-hykin
Copy link
Contributor

Simple argument helper, but could also be useful in frontend search components as well

@jeff-hykin jeff-hykin requested a review from kt3k as a code owner July 12, 2023 16:09
@CLAassistant
Copy link

CLAassistant commented Jul 12, 2023

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the flags label Jul 12, 2023
@lino-levan
Copy link
Contributor

I'm in favor of adding this, but should we made split this out into a different file? Also, I feel like there is value is exporting levenshteinDistanceBetween (though I would probably rename it to levenshteinDistance).

@jeff-hykin
Copy link
Contributor Author

jeff-hykin commented Jul 13, 2023

Putting it in a different file and renaming the function sounds good to me.

I also found a probably-more-efficient implementation of levenshteinDistance on deno.land. I could just copy it, but I wasn't sure about the license so I left my function in there.

flags/mod.ts Outdated
* @param {number} [options.suggestionLimit=Infinity] - Number of results to return
* @returns {string[]} An array of possible correct spellings for the given word.
*/
export function didYouMean(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name should be something like guessWord, guessCandidate, etc (No prior example of this kind of naming in std)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a fan of guessWord.

flags/mod.ts Outdated
Comment on lines 862 to 863
givenWord,
possibleWords,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually make mandatory arguments positional args, instead of putting them in option bag. ref. https://deno.land/[email protected]/references/contributing/style_guide#exported-functions-max-2-args-put-the-rest-into-an-options-object

@kt3k
Copy link
Member

kt3k commented Jul 13, 2023

I agree with @lino-levan . I think this API should be in somewhere else. I also agree that levenshteinDistance is also worth a public API. (PHP seems having levenshtein builtin function)

@lino-levan Do you have any suggestion about the place for these APIs? What if we create std/text or std/string module?

flags/mod.ts Outdated
/**
* Calculates the Levenshtein distance between two strings.
*
* @param {string} str1 - The first string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types in @param directive seem redundant as they are given in TypeScript. We usually omit them in our docs (For example, see https://github.com/denoland/deno_std/blob/6b1da3240ca5f9460df2571096d64566a7cb3c38/async/pool.ts#L31

@lino-levan
Copy link
Contributor

I'm of the opinion that this and #3440 ought to be in std/text. I think std/string would limit our future options.

@jeff-hykin
Copy link
Contributor Author

jeff-hykin commented Jul 13, 2023

I'm actually glad you guys bring up std/text cause I was planning on asking for such a thing in a future PR. I was literally going to add that kebab/pascal-case stuff that, it looks like, that other PR is adding

And just as a catch-all I don't really care about any small renaming, argument-change, file location kind of stuff. I'll probably implement all the suggestions tomorrow (including having it in a different file)

@jeff-hykin
Copy link
Contributor Author

alright I believe I implemented all the feedback

flags/mod.ts Outdated
@@ -784,4 +784,4 @@ export function parse<
}

return argv as Args<TArgs, TDoubleDash>;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like unrelated change, could you revert?

flags/test.ts Outdated
@@ -1906,3 +1906,41 @@ Deno.test("typesOfParseOptionsGenericDefaults", function () {
>
>(true);
});

Deno.test("basicDidYouMean", function () {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we move this to a seperate testing file in std/text?

text/mod.ts Outdated
* @param {number} [options.suggestionLimit=Infinity] - Number of results to return
* @returns {string[]} An array of possible correct spellings for the given word.
*/
export function didYouMean(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussion: Should we follow the single-function file pattern that we do elsewhere in std here? Does it make any sense to?

Copy link
Member

@kt3k kt3k Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should follow single-file export pattern. Users who only want levenshteinDistance would want to import from std/text/levenshtein_distance.ts

@kt3k
Copy link
Member

kt3k commented Jul 17, 2023

Could you also address didYouMean -> guessWord rename, and jsdoc typing clean up?

@timreichen
Copy link
Contributor

I think this is a really cool addition to for std, but I don't like the autoThrow option in didYouMean() (future guessWord()). It seems not very js api-like to me that a function returns or throws depending on an option.
The error is not really an error but a message meant to see by the end user which would require localisation imo.
Therefore I suggest to remove that option and the function to always return.

@lino-levan
Copy link
Contributor

I have to agree with @timreichen. autoThrow is a weird option, both in naming and in functionality. +1 for removing it.

@jeff-hykin
Copy link
Contributor Author

jeff-hykin commented Jul 17, 2023

@lino-levan

Could you also address didYouMean -> guessWord rename, and jsdoc typing clean up?

Sorry I misinterpreted your original request (I renamed givenWord to guessWord).

I do want to discuss this name change though (didYouMean -> guessWord). Based on evidence, including cliffy for deno, npm library names, pip module names, stack overflow questions, etc I believe people generally call this kind of behavior/feature "did you mean".

I'd say there's already a bit of visibilty problem with deno-std, like how 3rd party libraries show up first when searching for deep merge or debounce. If the deeply-merge function, for example, was called recursiveMerge in deno std, I think it exacerbates the visibilty problem (even if a name like "recursive merge" is technically still an accurate description of the behavior).

So for the sake of convention/searchability, I think didYouMean is the only name I have strong feelings about.

@jeff-hykin
Copy link
Contributor Author

jeff-hykin commented Jul 17, 2023

I think this is a really cool addition to for std, but I don't like the autoThrow option in didYouMean() (future guessWord()). It seems not very js api-like to me that a function returns or throws depending on an option.

@timreichen

I totally agree with this, it feels werid, but I would like brainstorm some alternatives. Maybe there should be another function with assert in the name?

I only added it because, in practice, it is a real pain to manually write the if-checks and error responses every time:

function somethingWithArgs(args) {
    const firstWordOptions = [ "length", "size", "help", ]
    const colorOptions = [ "red", "green", "blue", ]
    const videoFormatOptions = [ "mp4", "mov", ]
    
    // arg validation step
    didYouMean(args[0], firstWordOptions, { autoThrow: true })
    didYouMean(args.color, colorOptions, { autoThrow: true })
    didYouMean(args.format, videoFormatOptions, { autoThrow: true })

    /* actual logic */
}

// vs
function somethingWithArgs(args) {
     const firstWordOptions = [ "length", "size", "help", ]
     const colorOptions = [ "red", "green", "blue", ]
     const videoFormatOptions = [ "mp4", "mov", ]
     
     // arg validation step
    if (!firstWordOptions.includes(args[0])) {
        const suggestions = didYouMean({ guessWord: args[0], firstWordOptions }).join(", ")
        throw new Error(`${args[0]} isn't a valid argument, did you mean one of ${suggestions}?`)
    }
    if (!colorOptions.includes(args.color)) {
        const suggestions = didYouMean({ guessWord: args.color, videoFormatOptions }).join(", ")
        throw new Error(`${args.color} isn't a valid argument, did you mean one of ${suggestions}?`)
    }
    if (!colorOptions.includes(args.format)) {
        const suggestions = didYouMean({ guessWord: args.format, videoFormatOptions }).join(", ")
        throw new Error(`${args.format} isn't a valid argument, did you mean one of ${suggestions}?`)
    }

     /* actual logic */
}

@kt3k kt3k changed the title Add didYouMean feat(text): add didYouMean Aug 25, 2023
@jeff-hykin
Copy link
Contributor Author

Thanks for the changes @kt3k. I got halfway through them yesterday after updating from denoland:main but ran out of time.

Could you explain the _util.ts vs mod.ts, like why are some things in mod but others are in _util?

text/_util.ts Outdated
@@ -0,0 +1,129 @@
// Copyright 2018-2023 the Deno authors. All rights reserved. MIT license.
import { levenshteinDistance } from "./levenshtein_distance.ts";
import { assert } from "../assert/mod.ts";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kt3k confused by this one, why is this mod.ts and not assert.ts?

Copy link
Contributor Author

@jeff-hykin jeff-hykin Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a strong preference, but it was more generic; I could do a find-and-replace-all from the old location ("../testing/asserts.ts";) to the new location, but only if I used "mod.ts" (so that imports like assertEquals, assertThrows kept working)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hope would be that in the std we try to minimize the module graph, so not sure I support this move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I'll change it

text/_util.ts Outdated
* use a named-distance (e.g. levenshteinDistance) to
* guarantee a particular ordering
*/
export function closest(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is missing a word here. closestWord? Not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be closestString since spaces are allowed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closestString makes sense to me. I don't feel strongly either way, just brought this up for bikeshedding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

text/_util.ts Outdated
* use a named-distance (e.g. levenshteinDistance) to
* guarantee a particular ordering
*/
export function similarityCompare(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compareSimilarity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

text/_util.ts Outdated
* @param options.caseSensitive - Flag indicating whether the distance should include case. Default is false.
* @returns {string[]} A sorted copy of possibleWords
*/
export function wordSimilaritySort(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're already exposing similarityCompare, do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave that up to you guys, its the difference between:

let suggestions = [...possibleWords].sort(similarityCompare("hep", { caseSensitive })).join(", ")

and

let suggestions = wordSimilaritySort("hep", possibleWords).join(", ")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding why one would need to do [...possibleWords]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cause otherwise it mutates possibleWords, which might be sorted in a particular order already (for example an order that is used when the help command is invoked)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I should have clarified. Why not use .toSorted()? It would just be

let suggestions = possibleWords.toSorted(similarityCompare("hep", { caseSensitive })).join(", ")

which is a little more readable imo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer older compatibility (e.g. avoiding bleeding-edge ES2023 features), and plus toSorted is literally only one character shorter than the more explicit version.

@kt3k
Copy link
Member

kt3k commented Aug 26, 2023

@jeff-hykin

Could you explain the _util.ts vs mod.ts, like why are some things in mod but others are in _util?

APIs exported from mod.ts are public APIs of the module, and ones exported from _util.ts are not (i.e. They are internal utilities). Once we expose the APIs as public APIs, we can't change them easily in the future. I think we are ready to expose levenshteinDistance and didYouMean, but others (closestString, compareSimilarity, wordSimilaritySort) don't look ready to be exposed

@kt3k
Copy link
Member

kt3k commented Aug 26, 2023

I do want to discuss this name change though (didYouMean -> guessWord). Based on evidence, including cliffy for deno, npm library names, pip module names, stack overflow questions, etc I believe people generally call this kind of behavior/feature "did you mean".

I revisited these links, but what the above pip module does is different from what this PR does (They seem correcting typo based on the dictionary). Stackoverflow question refers to Did you mean feature of google search, which is also different from what this PR does. cliffy has didYouMean util, but that is not a public API. npm:didyoumean is probably the only one example similar to this, but the behavior is very different (they don't throw automatically).

I now feel unsure this design of didYouMean is appropriate for deno_std

(BTW I think levenshteinDistance definitely deserve std API as there's a prior example in php)

@jeff-hykin
Copy link
Contributor Author

jeff-hykin commented Aug 26, 2023

This is starting to feel a bit like review purgatory 😅

I'm just simply trying to contribute:

  • 2 tools people can find by searching "did you mean"
    1. an assert that validates things like enum arguments (not everyone uses TS) with a helpful runtime "That's not one of the options, did you mean X" error message when it's an invalid option
    1. a pure function that returns the most similar string when given a list of strings (closestString/Spell check/NPM didYouMean/Ruby didYouMean)

cliffy has didYouMean util, but that is not a public API

Yes that's the point; I can't use his implementation and the cliffy author, myself, and others have all had to reimplement the same tool because there isn't one available in the std. Even if it was public in cliffy, it would be weird to depend on cliffy just to validate an enum argument in a function that's going to be used in a web interface. "I need to guarantee that this string belongs to a predefined list of strings, and throw a helpful error if that's not the case" is a common need, and for every duplicate implementation, there's probably 10 more devs that decided "I'm just going to throw an unhelpful error message, after all, I'm writing a tiny JS function and it doesn't make sense to create 2x the work implementing name similarity just to improve the error"

They seem correcting typo based on the dictionary

This is why it feels like purgatory; the original PR did correct typos based on a list of strings but based on feedback we renamed it to "closestString" and then the _utils change made it private. It feels like we're going around in circles while also blocking generic and commonly needed functionality.

I agree naming is important, and I feel like things that throw should be an assertX or requireX. So, would it be sufficient to:

  • rename+change the current didYouMean to assertOneOf, make it handle more generic types gracefully, and have it detect the case of input=String, and options=Array[String] in which case it provides a "did you mean" error.
  • Then rename or alias closestString to didYouMean

@kt3k
Copy link
Member

kt3k commented Aug 28, 2023

Sorry for pressuring!

rename+change the current didYouMean to assertOneOf, make it handle more generic types gracefully, and have it detect the case of input=String, and options=Array[String] in which case it provides a "did you mean" error.

I feel assertOneOf throwing "did you mean" error is too arbitrary design, and feels confusing.

I looked through the APIs again. Now I feel 3 APIs in _util.ts are more appropriate for deno_std in API naming and functionality (sorry for changing my mind again!), and if we have these 3 APIs, the feature like didYouMean can be easily implemented using closestString or wordSimilaritySort.

I'd like to suggest the below:

  • Make levenshteinDistance, closestString, compareSimilarity, and wordSimilaritySort public APIs.
  • In jsdoc example of wordSimilaritySort, we describe how didYouMean can be implemented using it (i.e. don't provide didYouMean as an API)

If we have didYouMean function in the jsdoc example, that'll be indexed in our doc search in Deno homepage, and the user should have access to it easily. What do you think?

@iuioiua
Copy link
Contributor

iuioiua commented Oct 28, 2023

Hi @jeff-hykin, I spoke with @kt3k. Are you still happy to pursue this PR with the above comment's suggestions?

@jeff-hykin
Copy link
Contributor Author

jeff-hykin commented Oct 30, 2023

Thanks for the ping, and sorry for leaving this PR hanging. I am still interested in it.

Yeah, I can make the changes and get this branch back up to date with main.

If we have didYouMean function in the jsdoc example, that'll be indexed in our doc search in Deno homepage, and the user should have access to it easily

This is good, thanks for addressing one of my core concerns.

assertOneOf throwing "did you mean" error is too arbitrary design, and feels confusing.

I'm onboard with this sentiment; purity/cleanliness is a top priority for std. Even small things, like --_ causing problems for the std flag parser bother me.

_

That said, I still feel there is a gap here. Maybe I we can merge those four functions to get this PR closed and open up a new PR to continue this discussion. For now I'll continue it here.

The "let the user implement it" is exactly what I'm trying to avoid. I mean, if the expression is so easy that the end user would just inline it; ok thats fine. But when they need to define a helper function, a function that isn't customized to their specific project in any way, then I think theres a problem.

In my opinion me, my coworker, and my student shouldn't each need our own stdlib. But if deno std doesn't implement stuff like assertOneOf, then we will (either that or we need to copy and paste our implementation from project to project which also seems terrible).

@iuioiua iuioiua changed the title feat(text): add didYouMean feat: add std/text with word-similarity helpers Oct 31, 2023
@iuioiua
Copy link
Contributor

iuioiua commented Oct 31, 2023

I've created #3752 to continue the discussion of didYouMean() in std/text. @jeff-hykin, can you please run deno fmt and make any final adjustments, then let us know once it's ready?

@jeff-hykin
Copy link
Contributor Author

If we have didYouMean function in the jsdoc example

Should I put all 80 lines of the didYouMean function in a jsdoc example? Or should we just merge as-is and discuss this on #3752 ? @kt3k

Copy link
Member

@kt3k kt3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now looks reasonable to me. Thanks for updating! @jeff-hykin

LGTM

@kt3k kt3k merged commit 06a676f into denoland:main Nov 5, 2023
@jeff-hykin
Copy link
Contributor Author

Thanks @kt3k !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants