Skip to content

Commit 17dede4

Browse files
situ2001levidingsonghn233
authored
feat: translation 1-js/99-js-misc/06-unicode/article.md (javascript-tutorial#1125)
* feat: translation for article `Unicode, String internals` * Update article.md * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: Songhn <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: LeviDing <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: LeviDing <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: LeviDing <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: LeviDing <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: LeviDing <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: LeviDing <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: LeviDing <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: LeviDing <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: LeviDing <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: LeviDing <[email protected]> * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md * Update 1-js/99-js-misc/06-unicode/article.md Co-authored-by: LeviDing <[email protected]> Co-authored-by: Songhn <[email protected]>
1 parent e9ab2e3 commit 17dede4

File tree

1 file changed

+65
-65
lines changed

1 file changed

+65
-65
lines changed

1-js/99-js-misc/06-unicode/article.md

Lines changed: 65 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,172 +1,172 @@
11

2-
# Unicode, String internals
2+
# Unicode —— 字符串内幕
33

4-
```warn header="Advanced knowledge"
5-
The section goes deeper into string internals. This knowledge will be useful for you if you plan to deal with emoji, rare mathematical or hieroglyphic characters, or other rare symbols.
4+
```warn header="进阶知识"
5+
本节将更深入地介绍字符串的内部原理。如果你打算处理表情符号(emoji)、罕见的数学或象形文字字符,或其他罕见字符,这些知识将对你很有用。
66
```
77

8-
As we already know, JavaScript strings are based on [Unicode](https://en.wikipedia.org/wiki/Unicode): each character is represented by a byte sequence of 1-4 bytes.
8+
正如我们所知,JavaScript 的字符串是基于 [Unicode](https://en.wikipedia.org/wiki/Unicode) 的:每个字符由 1-4 个字节的字节序列表示。
99

10-
JavaScript allows us to insert a character into a string by specifying its hexadecimal Unicode code with one of these three notations:
10+
JavaScript 允许我们通过下述三种表示方式之一将一个字符以其十六进制 Unicode 编码的方式插入到字符串中:
1111

1212
- `\xXX`
1313

14-
`XX` must be two hexadecimal digits with a value between `00` and `FF`, then `\xXX` is the character whose Unicode code is `XX`.
14+
`XX` 必须是介于 `00` `FF` 之间的两位十六进制数,`\xXX` 表示 Unicode 编码为 `XX` 的字符。
1515

16-
Because the `\xXX` notation supports only two hexadecimal digits, it can be used only for the first 256 Unicode characters.
16+
因为 `\xXX` 符号只支持两位十六进制数,所以它只能用于前 256 Unicode 字符。
1717

18-
These first 256 characters include the Latin alphabet, most basic syntax characters, and some others. For example, `"\x7A"` is the same as `"z"` (Unicode `U+007A`).
18+
这前 256 个字符包括拉丁字母、最基本的语法字符和其他一些字符。例如,`"\x7A"` 表示 `"z"` (Unicode 编码为 `U+007A`)
1919

2020
```js run
2121
alert( "\x7A" ); // z
22-
alert( "\xA9" ); // ©, the copyright symbol
22+
alert( "\xA9" ); // © (版权符号)
2323
```
2424

2525
- `\uXXXX`
26-
`XXXX` must be exactly 4 hex digits with the value between `0000` and `FFFF`, then `\uXXXX` is the character whose Unicode code is `XXXX`.
26+
`XXXX` 必须是 4 位十六进制数,值介于 `0000` `FFFF` 之间。此时,`\uXXXX` 便表示 Unicode 编码为 `XXXX` 的字符。
2727
28-
Characters with Unicode values greater than `U+FFFF` can also be represented with this notation, but in this case, we will need to use a so called surrogate pair (we will talk about surrogate pairs later in this chapter).
28+
Unicode 值大于 `U+FFFF` 的字符也可以用这种方法来表示,但在这种情况下,我们要用到代理对(我们将在本章的后面讨论它)。
2929
3030
```js run
31-
alert( "\u00A9" ); // ©, the same as \xA9, using the 4-digit hex notation
32-
alert( "\u044F" ); // я, the Cyrillic alphabet letter
33-
alert( "\u2191" ); //, the arrow up symbol
31+
alert( "\u00A9" ); // ©, 等同于 \xA9,只是使用了四位十六进制数表示而已
32+
alert( "\u044F" ); // я(西里尔字母)
33+
alert( "\u2191" ); //(上箭头符号)
3434
```
3535

3636
- `\u{X…XXXXXX}`
3737
38-
`XXXXXXX` must be a hexadecimal value of 1 to 6 bytes between `0` and `10FFFF` (the highest code point defined by Unicode). This notation allows us to easily represent all existing Unicode characters.
38+
`XXXXXXX` 必须是介于 `0` `10FFFF`Unicode 定义的最高码位)之间的 1 到 6 个字节的十六进制值。这种表示方式让我们能够轻松地表示所有现有的 Unicode 字符。
3939
4040
```js run
41-
alert( "\u{20331}" ); // 佫, a rare Chinese character (long Unicode)
42-
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode)
41+
alert( "\u{20331}" ); // 佫, 一个不常见的中文字符(长 Unicode
42+
alert( "\u{1F60D}" ); // 😍, 一个微笑符号(另一个长 Unicode
4343
```
4444

45-
## Surrogate pairs
45+
## 代理对
4646

47-
All frequently used characters have 2-byte codes (4 hex digits). Letters in most European languages, numbers, and the basic unified CJK ideographic sets (CJK -- from Chinese, Japanese, and Korean writing systems), have a 2-byte representation.
47+
所有常用字符都有对应的 2 字节长度的编码(4 位十六进制数)。大多数欧洲语言的字母、数字、以及基本统一的 CJK 表意文字集(CJK —— 来自中文、日文和韩文书写系统)中的字母,均有对应的 2 字节长度的 Unicode 编码。
4848

49-
Initially, JavaScript was based on UTF-16 encoding that only allowed 2 bytes per character. But 2 bytes only allow 65536 combinations and that's not enough for every possible symbol of Unicode.
49+
最初,JavaScript 是基于 UTF-16 编码的,只允许每个字符占 2 个字节长度。但 2 个字节只允许 65536 种组合,这对于表示 Unicode 里每个可能符的号来说,是不够的。
5050

51-
So rare symbols that require more than 2 bytes are encoded with a pair of 2-byte characters called "a surrogate pair".
51+
因此,需要使用超过 2 个字节长度来表示的稀有符号,我们则使用一对 2 字节长度的字符编码,它被称为“代理对”(surrogate pair)。
5252

53-
As a side effect, the length of such symbols is `2`:
53+
这种做也有副作用 —— 这些符号的长度为 `2`
5454

5555
```js run
56-
alert( '𝒳'.length ); // 2, MATHEMATICAL SCRIPT CAPITAL X
57-
alert( '😂'.length ); // 2, FACE WITH TEARS OF JOY
58-
alert( '𩷶'.length ); // 2, a rare Chinese character
56+
alert( '𝒳'.length ); // 2, 大写的数学符号 X
57+
alert( '😂'.length ); // 2, 笑哭的表情
58+
alert( '𩷶'.length ); // 2, 一个少见的中文字符
5959
```
6060

61-
That's because surrogate pairs did not exist at the time when JavaScript was created, and thus are not correctly processed by the language!
61+
这是因为在 JavaScript 被创造出来的时候,代理对这个概念并不存在,因此语言并没有正确处理它们!
6262

63-
We actually have a single symbol in each of the strings above, but the `length` property shows a length of `2`.
63+
虽然上面的每个字符串都只有一个字符,但其 `length` 属性显示其长度为 `2`
6464

65-
Getting a symbol can also be tricky, because most language features treat surrogate pairs as two characters.
65+
如何获取这些符号,也是一个棘手的问题:因为编程语言的大部分功能都将代理对当作两个字符对待。
6666

67-
For example, here we can see two odd characters in the output:
67+
举个例子,我们可以在输出中看到两个奇怪的字符:
6868

6969
```js run
70-
alert( '𝒳'[0] ); // shows strange symbols...
71-
alert( '𝒳'[1] ); // ...pieces of the surrogate pair
70+
alert( '𝒳'[0] ); // 显示出了一个奇怪的符号...
71+
alert( '𝒳'[1] ); // ...代理对的片段
7272
```
7373

74-
Pieces of a surrogate pair have no meaning without each other. So the alerts in the example above actually display garbage.
74+
代理对的片段失去彼此就没有意义。所以上面示例中 `alert()` 打印出的内容其实就是没有任何意义的垃圾信息。
7575

76-
Technically, surrogate pairs are also detectable by their codes: if a character has the code in the interval of `0xd800..0xdbff`, then it is the first part of the surrogate pair. The next character (second part) must have the code in interval `0xdc00..0xdfff`. These intervals are reserved exclusively for surrogate pairs by the standard.
76+
从技术上讲,可以通过代理对的编码来检测代理对:如果一个字符的编码在 `0xd800..0xdbff` 这个范围中,那么它就是代理对的前一个部分。下一个字符(第二部分)的编码必须在 `0xdc00..0xdfff` 范围中。这两个范围中的编码是规范中专为代理对预留的。
7777

78-
So the methods [String.fromCodePoint](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint) and [str.codePointAt](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt) were added in JavaScript to deal with surrogate pairs.
78+
基于此,JavaScript 新增了 [String.fromCodePoint](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint) [str.codePointAt](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt) 这两个方法来处理代理对。
7979

80-
They are essentially the same as [String.fromCharCode](mdn:js/String/fromCharCode) and [str.charCodeAt](mdn:js/String/charCodeAt), but they treat surrogate pairs correctly.
80+
它们本质上与 [String.fromCharCode](mdn:js/String/fromCharCode) [str.charCodeAt](mdn:js/String/charCodeAt) 相同,但它们可以正确地处理代理对。
8181

82-
One can see the difference here:
82+
在这里可以看出它们的区别:
8383

8484
```js run
85-
// charCodeAt is not surrogate-pair aware, so it gives codes for the 1st part of 𝒳:
85+
// charCodeAt 不会考虑代理对,所以返回了 𝒳 前半部分的编码:
8686
8787
alert( '𝒳'.charCodeAt(0).toString(16) ); // d835
8888
89-
// codePointAt is surrogate-pair aware
90-
alert( '𝒳'.codePointAt(0).toString(16) ); // 1d4b3, reads both parts of the surrogate pair
89+
// codePointAt 可以正确处理代理对
90+
alert( '𝒳'.codePointAt(0).toString(16) ); // 1d4b3,读取到了完整的代理对
9191
```
9292

93-
That said, if we take from position 1 (and that's rather incorrect here), then they both return only the 2nd part of the pair:
93+
也就是说,如果我们从 `𝒳` 的位置 1 开始获取对应的编码(这么做是不对的),那么这两个方法都只会返回此代理对的后半部分:
9494

9595
```js run
9696
alert( '𝒳'.charCodeAt(1).toString(16) ); // dcb3
9797
alert( '𝒳'.codePointAt(1).toString(16) ); // dcb3
98-
// meaningless 2nd half of the pair
98+
// 无意义的代理对后半部分
9999
```
100100

101-
You will find more ways to deal with surrogate pairs later in the chapter <info:iterable>. There are probably special libraries for that too, but nothing famous enough to suggest here.
101+
你稍后可以在 <info:iterable> 一章中找到更多处理代理对的方式。可能也有专门处理代理对的库,但没有足够流行到可以让我们在这里推荐的库。
102102

103-
````warn header="Takeaway: splitting strings at an arbitrary point is dangerous"
104-
We can't just split a string at an arbitrary position, e.g. take `str.slice(0, 4)` and expect it to be a valid string, e.g.:
103+
````warn header="注意:在任意点拆分字符串是很危险的"
104+
我们不能随意在任意位置对字符串进行拆分,例如通过 `str.slice(0, 4)` 获取一个字符串,并期待它是一个有效的字符串:
105105

106106
```js run
107107
alert( 'hi 😂'.slice(0, 4) ); // hi [?]
108108
```
109109

110-
Here we can see a garbage character (first half of the smile surrogate pair) in the output.
110+
在这里,我们看到一个没有意义的垃圾字符被打印了出来(笑哭表情代理对的前半部分)。
111111

112-
Just be aware of it if you intend to reliably work with surrogate pairs. May not be a big problem, but at least you should understand what happens.
112+
如果你期望可靠地使用代理对,请注意这一点。这可能并不是什么大问题,但至少你应该知道发生了什么。
113113
````
114114

115-
## Diacritical marks and normalization
115+
## 变音符号和规范化
116116

117-
In many languages, there are symbols that are composed of the base character with a mark above/under it.
117+
很多语言都有由基础字符及其上方/下方的标记所组成的符号。
118118

119-
For instance, the letter `a` can be the base character for these characters: `àáâäãåā`.
119+
举个例子,字母 `a` 就是这些字符 `àáâäãåā` 的基础字符。
120120

121-
Most common "composite" characters have their own code in the Unicode table. But not all of them, because there are too many possible combinations.
121+
大多数常见的“复合”字符在 Unicode 表中都有自己的编码。但不是所有这些字符都有自己的编码,因为可能的组合形式太多了。
122122

123-
To support arbitrary compositions, the Unicode standard allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
123+
为了支持任意的组合,Unicode 标准允许我们使用多个 Unicode 字符:基础字符后跟着一个或多个“装饰”它的“标记”字符。
124124

125-
For instance, if we have `S` followed by the special "dot above" character (code `\u0307`), it is shown as Ṡ.
125+
例如,如果我们在 `S` 后附加上特殊的“上方的点”字符(编码为 `\u0307`),则显示为 Ṡ。
126126

127127
```js run
128128
alert( 'S\u0307' ); // Ṡ
129129
```
130130

131-
If we need an additional mark above the letter (or below it) -- no problem, just add the necessary mark character.
131+
如果我们需要在字母上方(或下方)添加一个额外的标记 —— 很简单,只需添加必要的标记字符即可。
132132

133-
For instance, if we append a character "dot below" (code `\u0323`), then we'll have "S with dots above and below": `Ṩ`.
133+
例如,如果我们继续在后面附加一个“下方的点”符号(编码 `\u0323`),那么我们将得到一个“上下都有一个点符号的 S”:``
134134

135-
For example:
135+
就像这样:
136136

137137
```js run
138138
alert( 'S\u0307\u0323' ); // Ṩ
139139
```
140140

141-
This provides great flexibility, but also an interesting problem: two characters may visually look the same, but be represented with different Unicode compositions.
141+
这提供了极大的灵活性,但也带来了一个有趣的问题:两个字符可能在视觉上看起来相同,但却使用的是不同的 Unicode 组合。
142142

143-
For instance:
143+
举个例子:
144144

145145
```js run
146-
let s1 = 'S\u0307\u0323'; // Ṩ, S + dot above + dot below
147-
let s2 = 'S\u0323\u0307'; // Ṩ, S + dot below + dot above
146+
let s1 = 'S\u0307\u0323'; // Ṩ, S + 上方点符号 + 下方点符号
147+
let s2 = 'S\u0323\u0307'; // Ṩ, S + 下方点符号 + 上方点符号
148148
149149
alert( `s1: ${s1}, s2: ${s2}` );
150150
151-
alert( s1 == s2 ); // false though the characters look identical (?!)
151+
alert( s1 == s2 ); // 尽管这两个字符在我们看来是相通的,但结果却是 false
152152
```
153153

154-
To solve this, there exists a "Unicode normalization" algorithm that brings each string to the single "normal" form.
154+
Unicode 规范化”算法可以解决这个问题,该算法将每个字符串转换为单一的“规范的”形式。
155155

156-
It is implemented by [str.normalize()](mdn:js/String/normalize).
156+
可以借助 [str.normalize()](mdn:js/String/normalize) 实现这一点。
157157

158158
```js run
159159
alert( "S\u0307\u0323".normalize() == "S\u0323\u0307".normalize() ); // true
160160
```
161161

162-
It's funny that in our situation `normalize()` actually brings together a sequence of 3 characters to one: `\u1e68` (S with two dots).
162+
有意思的是,在我们这个例子中,`normalize()` 3 个字符的序列合并为了一个字符:`\u1e68`(带有上下两个点的 S)。
163163

164164
```js run
165165
alert( "S\u0307\u0323".normalize().length ); // 1
166166
167167
alert( "S\u0307\u0323".normalize() == "\u1e68" ); // true
168168
```
169169

170-
In reality, this is not always the case. The reason is that the symbol `` is "common enough", so Unicode creators included it in the main table and gave it the code.
170+
但实际并非总是如此。出现这种情况的原因是符号 `` 是“足够常见的”,所以 Unicode 创建者将其囊括在了 Unicode 主表中,并为其提供了对应的编码。
171171

172-
If you want to learn more about normalization rules and variants -- they are described in the appendix of the Unicode standard: [Unicode Normalization Forms](https://www.unicode.org/reports/tr15/), but for most practical purposes the information from this section is enough.
172+
如果你想了解关于 Unicode 规范化规则和变体的更多信息,可以参阅 Unicode 标准的附录中的内容:[Unicode 规范化形式](https://www.unicode.org/reports/tr15/)。但就实用而言,本节中的信息就已经足够了。

0 commit comments

Comments
 (0)