24 releases
| 0.12.0 | Aug 13, 2025 |
|---|---|
| 0.11.3 | Apr 21, 2025 |
| 0.11.2 | Jul 22, 2023 |
| 0.11.1 | Mar 19, 2023 |
| 0.1.3 | Feb 7, 2020 |
#201 in Encoding
7,403 downloads per month
Used in 16 crates
(11 directly)
2.5MB
26K
SLoC
sentencepiece
This Rust crate is a binding for the sentencepiece unsupervised text tokenizer. The crate documentation is available online.
libsentencepiece dependency
This crate depends on the sentencepiece C++ library. By default,
this dependency is treated as follows:
- If
sentencepiececould be found withpkg-config, the crate will link against the library found throughpkg-config. Warning: dynamic linking only works correctly with sentencepiece 0.1.95 or later, due to a bug in earlier versions. - Otherwise, the crate's build script will do a static build of the
sentencepiecelibrary. This requires thatcmakeis available.
If you wish to override this behavior, the sentencepiece-sys crate
offers two features:
system: always attempt to link to thesentencepiecelibrary found withpkg-config.static: always do a static build of thesentencepiecelibrary and link against that.
lib.rs:
This crate binds the sentencepiece library. sentencepiece is an unsupervised text tokenizer.
The main data structure of this crate is SentencePieceProcessor,
which is used to tokenize sentences:
use sentencepiece::SentencePieceProcessor;
let spp = SentencePieceProcessor::open("testdata/toy.model").unwrap();
let pieces = spp.encode("I saw a girl with a telescope.").unwrap()
.into_iter().map(|p| p.piece).collect::<Vec<_>>();
assert_eq!(pieces, vec!["▁I", "▁saw", "▁a", "▁girl", "▁with",
"▁a", "▁t", "el", "es", "c", "o", "pe", "."]);