24 releases

0.12.0 Aug 13, 2025
0.11.3 Apr 21, 2025
0.11.2 Jul 22, 2023
0.11.1 Mar 19, 2023
0.1.3 Feb 7, 2020

#201 in Encoding

Download history 1775/week @ 2025-08-27 2156/week @ 2025-09-03 2424/week @ 2025-09-10 2292/week @ 2025-09-17 2043/week @ 2025-09-24 1754/week @ 2025-10-01 2613/week @ 2025-10-08 1781/week @ 2025-10-15 2140/week @ 2025-10-22 6660/week @ 2025-10-29 8297/week @ 2025-11-05 6715/week @ 2025-11-12 1908/week @ 2025-11-19 1621/week @ 2025-11-26 1806/week @ 2025-12-03 1621/week @ 2025-12-10

7,403 downloads per month
Used in 16 crates (11 directly)

MIT/Apache

2.5MB
26K SLoC

C++ 25K SLoC // 0.1% comments Rust 1K SLoC // 0.0% comments Bitbake 370 SLoC // 0.5% comments Shell 4 SLoC

sentencepiece

This Rust crate is a binding for the sentencepiece unsupervised text tokenizer. The crate documentation is available online.

libsentencepiece dependency

This crate depends on the sentencepiece C++ library. By default, this dependency is treated as follows:

  • If sentencepiece could be found with pkg-config, the crate will link against the library found through pkg-config. Warning: dynamic linking only works correctly with sentencepiece 0.1.95 or later, due to a bug in earlier versions.
  • Otherwise, the crate's build script will do a static build of the sentencepiece library. This requires that cmake is available.

If you wish to override this behavior, the sentencepiece-sys crate offers two features:

  • system: always attempt to link to the sentencepiece library found with pkg-config.
  • static: always do a static build of the sentencepiece library and link against that.

lib.rs:

This crate binds the sentencepiece library. sentencepiece is an unsupervised text tokenizer.

The main data structure of this crate is SentencePieceProcessor, which is used to tokenize sentences:

use sentencepiece::SentencePieceProcessor;

let spp = SentencePieceProcessor::open("testdata/toy.model").unwrap();
let pieces = spp.encode("I saw a girl with a telescope.").unwrap()
  .into_iter().map(|p| p.piece).collect::<Vec<_>>();
assert_eq!(pieces, vec!["▁I", "▁saw", "▁a", "▁girl", "▁with",
  "▁a", "▁t", "el", "es", "c", "o", "pe", "."]);

Dependencies