UTF-8 位表示

2024-6-12 • tag-icon

UTF-8 位表示

我正在学习 UTF-8 标准，以下是我正在学习的内容：

Definition and bytes used
UTF-8 binary representation         Meaning
0xxxxxxx                            1 byte for 1 to 7 bits chars
110xxxxx 10xxxxxx                   2 bytes for 8 to 11 bits chars
1110xxxx 10xxxxxx 10xxxxxx          3 bytes for 12 to 16 bits chars
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4 bytes for 17 to 21 bits chars

我想知道，为什么不使用 2 字节 UTF-8 代码10xxxxxx，从而使用 4 字节 UTF-8 代码增加 1 位，直到 22 位？目前的情况是，64 个可能的值丢失了（从1000000到10111111）。我并不是想争论标准，但我想知道为什么会这样？

**编辑**

甚至，为什么不呢

UTF-8 binary representation         Meaning
0xxxxxxx                            1 byte for 1 to 7 bits chars
110xxxxx xxxxxxxx                   2 bytes for 8 to 13 bits chars
1110xxxx xxxxxxxx xxxxxxxx          3 bytes for 14 to 20 bits chars
11110xxx xxxxxxxx xxxxxxxx xxxxxxxx 4 bytes for 21 to 27 bits chars

...？

谢谢！

答案1

UTF-8 是自同步的。通过检查字节，可以判断它是位于 UTF-8 字符的开头，还是位于字符的中间。

假设你的方案中有两个角色：10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

如果解析器在第二个八位字节处拾取，它无法判断它不应该将第二个和第三个八位字节读为一个字符。使用 UTF-8，解析器可以判断它位于字符的中间，并继续前进到下一个字符的开头，同时发出一些状态来提及损坏的符号。

对于编辑：如果最高位清楚，UTF-8 解析器知道他们正在查看一个以一个八位字节表示的字符。如果已设置，则它是一个多八位字节字符。

这一切都与错误恢复和八位字节的轻松分类有关。

相关内容