Skip to content

[Request] Clarify token pruning docs #481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kderusso opened this issue Feb 18, 2025 · 0 comments
Open

[Request] Clarify token pruning docs #481

kderusso opened this issue Feb 18, 2025 · 0 comments
Labels
Team:Search Issues owned by the Search Docs Team

Comments

@kderusso
Copy link
Member

See Slack for more context.

Right now we say:

tokens_weight_threshold: Tokens whose weight is less than tokens_weight_threshold are considered insignificant and pruned. This value must be between 0 and 1. Default: 0.4.

This is misleading.

By setting the tokens_freq_ratio_threshold to 10, you are saying that in order to be pruned, a document must be 10x more frequent than the average token across all tokens in all documents for that field. This is higher than the default of 5, so you’re dialing this back and requiring tokens to be even more frequent in order to be pruned. In practice, I would expect this would prune only extremely common tokens - think common words like is and the for example.

By setting the tokens_weight_threshold to 0.4, you are saying that you want to take the best scoring token, and never prune anything that’s more than 40% of that score. Because scores can vary so widely in any given text search results, we can’t issue a blanket “this is the minimum score” and still expect to have consistently good results. Instead, let’s say your top score was 0.2. That means that in order to be pruned, a token’s score would have to be below 0.08.
Both of those criteria must match for a token to be pruned.

@bmorelli25 bmorelli25 added needs-team Issues pending triage by the Docs Team Team:Platform Issues owned by the Platform Docs Team labels Apr 17, 2025
@github-actions github-actions bot removed the needs-team Issues pending triage by the Docs Team label Apr 17, 2025
@georgewallace georgewallace added Team:Search Issues owned by the Search Docs Team and removed Team:Platform Issues owned by the Platform Docs Team labels Apr 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Search Issues owned by the Search Docs Team
Projects
None yet
Development

No branches or pull requests

3 participants