Skip to content
This repository was archived by the owner on Jan 18, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/cd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
- name: Setup mdBook
uses: peaceiris/actions-mdbook@v1
with:
mdbook-version: '0.4.3'
mdbook-version: '0.4.4'

- name: Build the book
run: |
Expand Down
14 changes: 7 additions & 7 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,54 +26,54 @@ jobs:
rm -v "$HOME/.rustup"
rm -v "$HOME/.cargo"
fi
echo "::add-path::$HOME/.cargo/bin"
echo "$HOME/.cargo/bin" >> $GITHUB_PATH

- name: Cache rustup
uses: actions/cache@v1
with:
path: ~/.rustup
key: ${{ runner.os }}-rustup-${{ hashFiles('**/Cargo.lock') }}
restore-key: |
restore-keys: |
${{ runner.os }}-rustup-

- name: Cache cargo binaries
uses: actions/cache@v1
with:
path: ~/.cargo/bin
key: ${{ runner.os }}-cargo-bin-${{ hashFiles('**/Cargo.lock') }}
restore-key: |
restore-keys: |
${{ runner.os }}-cargo-bin-

- name: Cache cargo registry cache
uses: actions/cache@v1
with:
path: ~/.cargo/registry/cache
key: ${{ runner.os }}-cargo-registry-cache-${{ hashFiles('**/Cargo.lock') }}
restore-key: |
restore-keys: |
${{ runner.os }}-cargo-registry-cache-

- name: Cache cargo registry index
uses: actions/cache@v1
with:
path: ~/.cargo/registry/index
key: ${{ runner.os }}-cargo-registry-index-${{ hashFiles('**/Cargo.lock') }}
restore-key: |
restore-keys: |
${{ runner.os }}-cargo-registry-index-

- name: Cache cargo git db
uses: actions/cache@v1
with:
path: ~/.cargo/git/db
key: ${{ runner.os }}-cargo-git-db-${{ hashFiles('**/Cargo.lock') }}
restore-key: |
restore-keys: |
${{ runner.os }}-cargo-git-db-

- name: Cache cargo build
uses: actions/cache@v1
with:
path: target
key: ${{ runner.os }}-cargo-build-target-${{ hashFiles('**/Cargo.lock') }}
restore-key: |
restore-keys: |
${{ runner.os }}-cargo-build-target-

- name: Install stable Rust
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ Special-purpose sorts:

### String Manipulation

- [🚧 Hamming distance](src/hamming_distance)
- [Hamming distance](src/hamming_distance)
- [Levenshtein distance](src/levenshtein_distance)
- [🚧 Longest common substring](src/longest_common_substring)

Expand Down
2 changes: 1 addition & 1 deletion src/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@

### 字串處理

- [🚧 漢明距離 Hamming distance](hamming_distance)
- [漢明距離 Hamming distance](hamming_distance)
- [萊文斯坦距離 Levenshtein distance](levenshtein_distance)
- [🚧 最長共同子字串 Longest common substring](longest_common_substring)

Expand Down
2 changes: 1 addition & 1 deletion src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@

# 🧵 字串處理

- [🚧 漢明距離 Hamming distance]()
- [漢明距離 Hamming distance](hamming_distance/README.md)
- [萊文斯坦距離 Levenshtein distance](levenshtein_distance/README.md)
- [🚧 最長共同子字串 Longest common substring]()

Expand Down
63 changes: 63 additions & 0 deletions src/hamming_distance/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# 漢明距離 Hamming distance

漢明距離(Hamming distance)是指兩個相同長度的序列(sequence)在相同位置上,有多少個數值不同,對二進位序列來說就是「相異位元的數目」。漢明距離同時也是一種編輯距離,即是將一個字串轉換成另一個字串,需要經過多少次置換操作(substitute)。

漢明距離多應用於編碼理論中的錯誤更正(error-correcting),漢明碼(Hammming code)中計算距離的演算法即為漢明距離。

## 位元版實作

計算相異位元的數目其實就是一堆位元運算,如下:

```rust
{{#include mod.rs:bit}}
```

1. 透過 XOR 操作,讓兩序列相異位元為 1,相同位元為 0。
2. 如果 XOR 操作不為零,表示還有相異位元,繼續計算。
3. 將 XOR 結果和 1 做 AND 運算,看最低有效位(least significant digit)是否為 1。
4. 將 XOR 做位元右移,捨棄最低有效位,並回到步驟二。

> 根據 [《The Rust Reference》][] 指出,Rust 的位元右移在
>
> - 無符號整數(unsigned)是邏輯右移(logical right shift),也就是直接在最高有效位補 0;
> - 有符號整數(signed)則是算術右移(arithmetic right shift),右移之後符號會被保留。

[《The Rust Reference》]: https://doc.rust-lang.org/reference/expressions/operator-expr.html#arithmetic-and-logical-binary-operators

實際上,Rust 提供了一個原生的計算整數有多少個零的方法 [`{integer_type}::count_ones`][],可以省去自己做位元運算的麻煩,實作如下,帥:

```rust
pub fn hamming_distance(source: u64, target: u64) -> u32 {
(source ^ target).count_ones()
}
```

[`{integer_type}::count_ones`]: https://doc.rust-lang.org/stable/std/?search=count_ones

## 字串版實作

字串版的漢明距離就相對好懂了。

```rust
{{#include mod.rs:str}}
```

字串版同樣吃 `source` 和 `target` 兩個輸入。

1. 用 [`str::chars`][] 讓漢明距離可以比對 Unicode 字串,而非只有 ASCII,而 `str::chars` 是 `Iterator`,所以透過 `Iterator::next` 逐一比較每個字元。
2. 這裡 `if c1 != c2` 叫做 [match guard][],是在模式匹配之外,額外條件式檢查,因此,只有 `source` 和 `target` 都有下一個字元而且兩字元不相等才會進入這個匹配分支。
3. 若有任何一個是 `None`,另外一個是 `Some`,標示輸入字串的長度不同,直接噴錯。
4. 如果都沒有下一個字元,直接結束迴圈。
5. 其他情況,例如兩個字元相同,就繼續疊代。

[`str::chars`]: http://doc.rust-lang.org/std/primitive.str.html#method.chars
[match guard]: https://doc.rust-lang.org/reference/expressions/match-expr.html#match-guards

## 效能

長度為 n 的序列,計算漢明距離的時間複雜度為 $O(n)$,空間複雜度為 $O(1)$。

## 參考資料

- [Wiki: Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance)
- [演算法筆記:Correction](https://web.ntnu.edu.tw/~algo/Correction.html)
80 changes: 80 additions & 0 deletions src/hamming_distance/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
/// Calculate Hamming distance to two unsigned intergers.
// ANCHOR: bit
pub fn hamming_distance(source: u64, target: u64) -> u32 {
let mut count = 0;
let mut xor = source ^ target; // 1

// 2
while xor != 0 {
count += xor & 1; // 3
xor >>= 1; // 4
}

count as u32
}
// ANCHOR_END: bit

/// Calculate Hamming distance of two UTF-8 encoded strings.
// ANCHOR: str
pub fn hamming_distance_str(source: &str, target: &str) -> usize {
let mut count = 0;
// 1
let mut source = source.chars();
let mut target = target.chars();

loop {
// 2
match (source.next(), target.next()) {
// 3
(Some(c1), Some(c2)) if c1 != c2 => count += 1,
// 4
(None, Some(_)) | (Some(_), None) => panic!("Must have the same length"),
// 5
(None, None) => break,
// 6
_ => continue,
}
}

count
}
// ANCHOR_END: str

#[cfg(test)]
mod base {
use super::*;

#[test]
fn bit() {
let cases = [
(0, 0b0000_0000, 0b0000_0000),
(0, 0b1111_1111, 0b1111_1111),
(1, 0b0000_0001, 0b0000_0000),
(2, 0b0000_0000, 0b0001_1000),
(4, 0b1100_0011, 0b0110_0110),
(8, 0b0101_0101, 0b1010_1010),
];
for &(dist, c1, c2) in &cases {
assert_eq!(hamming_distance(c1, c2), dist);
}
}

#[test]
fn str() {
let cases = [
(0, "", ""),
(0, "rust", "rust"),
(1, "cat", "bat"),
(3, "abc", "xyz"),
];
for &(dist, c1, c2) in &cases {
assert_eq!(hamming_distance_str(c1, c2), dist);
}
}

#[test]
#[should_panic(expected = "Must have the same length")]
fn str_panic() {
hamming_distance_str("abc", "z");
}
}
3 changes: 3 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,6 @@ pub mod sorting;

mod levenshtein_distance;
pub use levenshtein_distance::{levenshtein_distance, levenshtein_distance_naive};

mod hamming_distance;
pub use hamming_distance::{hamming_distance, hamming_distance_str};