Skip to content

Address the feedback on the tokenizer's library #7024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Feb 26, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
f6e32f5
Fix cache when calling EncodeToIds
tarekgh Feb 17, 2024
0553922
Make EnglishRoberta _mergeRanks thread safe
tarekgh Feb 17, 2024
a4cb1f5
Delete Trainer
tarekgh Feb 19, 2024
6a13025
Remove the setters on the Bpe properties
tarekgh Feb 19, 2024
3278aff
Remove Roberta and Tiktoken special casing in the Tokenizer and suppo…
tarekgh Feb 19, 2024
b5f7fa2
Support text-embedding-3-small/large embedding
tarekgh Feb 19, 2024
a11f4e0
Remove redundant TokenToId abstraction and keep the one with the extr…
tarekgh Feb 19, 2024
865068a
Enable creating Tiktoken asynchronously or directly using the tokeniz…
tarekgh Feb 20, 2024
4077de0
Add cancellationToken support in CreateAsync APIs
tarekgh Feb 21, 2024
5aaf849
Rename sequence to text and Tokenize to Encode
tarekgh Feb 21, 2024
b5e0927
Rename skipSpecialTokens to considerSpecialTokens
tarekgh Feb 21, 2024
5e26b3e
Rename TokenizerResult to EncodingResult
tarekgh Feb 21, 2024
985de8a
Make Token publicly immutable
tarekgh Feb 21, 2024
b551e7d
Change offset tuples from (Index, End) to (Index, Length)
tarekgh Feb 21, 2024
7ea7f70
Rename NormalizedString method's parameters
tarekgh Feb 21, 2024
b0c8244
Rename Model's methods to start with verb
tarekgh Feb 21, 2024
450418a
Convert Model.GetVocab() method to a Vocab property
tarekgh Feb 21, 2024
6f53de8
Some method's parameters and variable renaming
tarekgh Feb 22, 2024
62334c6
Remove Vocab and VocabSize from the abstraction
tarekgh Feb 22, 2024
d48b32d
Cleanup normalization support
tarekgh Feb 22, 2024
191ab03
Minor Bpe cleanup
tarekgh Feb 22, 2024
b9b0f58
Resolve rebase change
tarekgh Feb 23, 2024
1ad157f
Address the feedback
tarekgh Feb 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Rename NormalizedString method's parameters
  • Loading branch information
tarekgh committed Feb 23, 2024
commit 7ea7f701e78fcacd0ac273ce119abf26724c426f
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,6 @@ public LowerCaseNormalizer() { }
/// </summary>
/// <param name="original">The original string to normalize to lowercase form.</param>
/// <returns>The lower-cased normalized string.</returns>
public override NormalizedString Normalize(string original) => new NormalizedString(original, original.ToLowerInvariant(), mapping: null, isOneToOneMapping: true);
public override NormalizedString Normalize(string original) => new NormalizedString(original, original.ToLowerInvariant(), normalizedToOriginalMapping: null, isOneToOneMapping: true);
}
}
14 changes: 7 additions & 7 deletions src/Microsoft.ML.Tokenizers/Normalizer/NormalizedString.cs
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,18 @@ public readonly struct NormalizedString
/// between the original and the normalized string.
/// </summary>
/// <param name="original">The original string before normalization.</param>
/// <param name="normalizedString">The normalized string.</param>
/// <param name="mapping">The mapping between the normalized string and the original string.</param>
/// <param name="normalized">The normalized string.</param>
/// <param name="normalizedToOriginalMapping">The mapping between the normalized string and the original string.</param>
/// <param name="isOneToOneMapping">Indicate whether the mapping is one-to-one.</param>
public NormalizedString(string original, string normalizedString, int[]? mapping, bool isOneToOneMapping)
public NormalizedString(string original, string normalized, int[]? normalizedToOriginalMapping, bool isOneToOneMapping)
{
Original = original;
Normalized = normalizedString;
NormalizedToOriginalMapping = mapping;
Normalized = normalized;
NormalizedToOriginalMapping = normalizedToOriginalMapping;

if (mapping is not null && mapping.Length < normalizedString.Length)
if (normalizedToOriginalMapping is not null && normalizedToOriginalMapping.Length < normalized.Length)
{
throw new ArgumentException($"Mapping array has to cover the whole normalized string length mapping", nameof(mapping));
throw new ArgumentException($"Mapping array has to cover the whole normalized string length mapping", nameof(normalizedToOriginalMapping));
}

IsOneToOneMapping = isOneToOneMapping;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,6 @@ public UpperCaseNormalizer() { }
/// </summary>
/// <param name="original">The original string to normalize to uppercase form.</param>
/// <returns>The upper-cased normalized string.</returns>
public override NormalizedString Normalize(string original) => new NormalizedString(original, original.ToUpperInvariant(), mapping: null, isOneToOneMapping: true);
public override NormalizedString Normalize(string original) => new NormalizedString(original, original.ToUpperInvariant(), normalizedToOriginalMapping: null, isOneToOneMapping: true);
}
}