我的日常活动涉及使用英语、法语、西班牙语,当我保存网页或其他文件的个人副本时,这些语言的全部字符范围都会出现在文件名和文件内容中。
当我出于各种目的(清理虚假字符、统计报告、归档审查)对文件名进行分区扫描(使用 find)时,我收到以下报告:
Invalid multibyte data detected. There may be a mismatch between your data and your locale.
以下是实际报告该情况的几行:
Fri 18 Nov 2022 06:51:33 PM EST Creating sorted list ... '/DB001_F4/0-DB001_F4-20221118185030.files' ...
14 Extracting type jpg ...
15 awk: cmd. line:1: (FILENAME=/DB001_F4/0-DB001_F4-20221118185030.details/0-DB001_F4-20221118185030. files_jpg FNR=303) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
16 Extracting type htm ...
17 awk: cmd. line:1: (FILENAME=/DB001_F4/0-DB001_F4-20221118185030.details/0-DB001_F4-20221118185030. files_htm FNR=325) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
18 Extracting type txt ...
19 awk: cmd. line:1: (FILENAME=/DB001_F4/0-DB001_F4-20221118185030.details/0-DB001_F4-20221118185030. files_txt FNR=322) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
20 Extracting type c ...
21 awk: cmd. line:1: (FILENAME=/DB001_F4/0-DB001_F4-20221118185030.details/0-DB001_F4-20221118185030.files_c FNR=1006) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
22 Extracting type url ...
23 awk: cmd. line:1: (FILENAME=/DB001_F4/0-DB001_F4-20221118185030.details/0-DB001_F4-20221118185030. files_url FNR=288) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
当我进一步挖掘并使用 Vim 检查报告列表时,我看到了类似这样的情况:
这表明它是如何表现一个角色的扩展 ASCII 字符集。 代替 ”<82>“,它应该显示”é“。这个字符在法语中随处可见。
当我使用以下命令按指定范围显示有问题的字符时,
grep --color='auto' -P "[\x80-\xFF]" *files_htm
我收到的不是来自“查找”的相同垃圾,而是一份正确显示字符的报告,即
由于该消息抱怨我的语言环境不匹配(这是 UTF-8,见底部),为了在我的台式电脑上正确处理所有上下文中的这些字符,我正在考虑切换我的Ubuntu MATE 20.04 桌面环境从UTF-8到UTF-16。
a) 我该怎么做邮政- 为当前的发行版安装?
b)我该怎么做邮政-为我的未来发行版(22.04)安装?
c)如果可能的话,我该怎么做预-安装 (或在安装过程中)以便发行版以 UTF-16 安装从一开始?
我的区域设置命令的报告如下:
root@OasisMega1:/DB001_F4/0-DB001_F4-20221118185030.details# locale
LANG=en_CA.UTF-8
LANGUAGE=en_CA:en
LC_CTYPE="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_PAPER="en_CA.UTF-8"
LC_NAME="en_CA.UTF-8"
LC_ADDRESS="en_CA.UTF-8"
LC_TELEPHONE="en_CA.UTF-8"
LC_MEASUREMENT="en_CA.UTF-8"
LC_IDENTIFICATION="en_CA.UTF-8"
LC_ALL=
root@OasisMega1:/DB001_F4/0-DB001_F4-20221118185030.details#
答案1
UTF-16 与 ASCII 不兼容。也许你不应该这样做。