对于源英文文件位于 docbook xml 中的官方 Ubuntu 文档,要求仅使用 ASCII 字符。我们使用“检查器”命令行(请参阅这里)。
grep --color='auto' -P -n "[\x80-\xFF]" *.xml
但是,该命令有一个缺陷,显然不是在所有计算机上,它会错过一些非 ASCII 字符的行,可能会导致错误的 OK 结果。
有人对 ASCII 检查器命令行有更好的建议吗?
有兴趣的人可以考虑使用这个文件(文本文件,不是 docbook xml 文件)作为测试用例。前三行包含非 ASCII 字符,分别是第 9、14 和 18 行。检查中遗漏了第 14 和 18 行:
$ grep --color='auto' -P -n "[\x80-\xFF]" install.en.txt | head -13
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community
520:1.2.1.4. Freedom and Philosophy
534:1.2.1.5. Ubuntu and other Debian derivatives
555:1.3. What is GNU/Linux?
答案1
您可以使用我在 GitHub 上托管的 Python 3 脚本打印文件的所有非 ASCII 行:
GitHub: ByteCommander/encoding-check
您可以克隆或下载整个存储库,也可以简单地保存文件encoding-check
并使用 使其可执行chmod +x encoding-check
。
然后您可以像这样运行它,使用要检查的文件作为唯一参数:
./encoding-check FILENAME
如果它位于您当前的工作目录中,或者....../path/to/encoding-check FILENAME
如果它位于/path/to/
,或者...encoding-check FILENAME
如果它位于$PATH
环境变量的目录中,即/usr/local/bin
或~/bin
。
如果不带任何可选参数,它将打印出发现非 ASCII 字符的每一行及其行号。最后,有一个摘要行,告诉您文件总共有多少行以及其中有多少行包含非 ASCII 字符。
此方法保证正确解码所有 ASCII 字符并检测所有绝对非 ASCII 的内容。
下面是对包含给定内容的前 20 行的文件运行的示例install.en.txt
:
$ ./encoding-check install-first20.en.txt
9: Appendix��F, GNU General Public License.
14: (codename "���Xenial Xerus���"), for the 64-bit PC ("amd64") architecture. It also
18: ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
--------------------------------------------------------------------------------
20 lines in 'install-first20.en.txt', thereof 3 lines with non-ASCII characters.
但是脚本有一些额外的参数来调整检查的编码和输出格式。查看帮助并尝试它们:
$ encoding-check -h
usage: encoding-check [-h] [-e ENCODING] [-s | -c | -l] [-m] [-w] [-n] [-f N]
[-t]
FILE [FILE ...]
Show all lines of a FILE containing characters that don't match the selected
ENCODING.
positional arguments:
FILE the file to be examined
optional arguments:
-h, --help show this help message and exit
-e ENCODING, --encoding ENCODING
file encoding to test (default 'ascii')
-s, --summary only print the summary
-c, --count only print the detected line count
-l, --lines only print the detected lines
-m, --only-matching hide files without matching lines from output
-w, --no-warnings hide warnings from output
-n, --no-numbers do not show line numbers in output
-f N, --fit-width N trim lines to N characters, or terminal width if N=0;
non-printable characters like tabs will be removed
-t, --title print title line above each file
因为--encoding
,Python 3 所知道的每个编解码器都是有效的。只需尝试一个,在最坏的情况下,您会收到一条小错误消息...
答案2
如果您想查找非 ASCII 字符,也许您应该反转搜索以排除 ASCII 字符:
grep -Pn '[^\x00-\x7F]'
例如:
$ curl https://help.ubuntu.com/16.04/installation-guide/amd64/install.en.txt -s | grep -nP '[^\x00-\x7F]' | head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368: • Ubuntu will always be free of charge, and there is no extra fee for the "
372: • Ubuntu includes the very best in translations and accessibility
376: • Ubuntu is shipped in stable and regular release cycles; a new release will
380: • Ubuntu is entirely committed to the principles of open source software
在第 9、330、337 和 359 行中,Unicode 不间断空格字符存在。
您得到的特定输出可能是由于grep
对 UTF-8 的支持。对于 Unicode 语言环境,其中一些字符可能比较相等转换为普通 ASCII 字符。在这种情况下,强制使用 C 语言环境将显示预期结果:
$ LANG=C grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368: • Ubuntu will always be free of charge, and there is no extra fee for the "
372: • Ubuntu includes the very best in translations and accessibility
376: • Ubuntu is shipped in stable and regular release cycles; a new release will
380: • Ubuntu is entirely committed to the principles of open source software
$ LANG=en_GB.UTF-8 grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community
答案3
这个 Perl 命令基本上替代了那个grep
命令(缺少的是颜色):
perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' *.xml
n
:导致 Perl 假设您的程序周围有以下循环,这使得它像 sed -n 或 awk 一样迭代文件名参数:LINE: while (<>) { ... # your program goes here }
-e
:可用于输入一行程序。/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)
:如果行包含范围内的字符\x80-\xFF
,则打印当前文件的名称、当前文件的行号、字符串:\t^
和当前行的内容。
输出包含问题中的示例文件的示例目录和仅包含ààààà
换行符的文件:
% perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' file | head -n 10
file(9): ^Appendix F, GNU General Public License.
file(14): ^(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
file(18): ^â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”
file(330): ^when things go wrong. The Installation Howto can be found in Appendix A,
file(337): ^Chapter 1. Welcome to Ubuntu
file(359): ^1.1. What is Ubuntu?
file(368): ^ • Ubuntu will always be free of charge, and there is no extra fee for the "
file(372): ^ • Ubuntu includes the very best in translations and accessibility
file(376): ^ • Ubuntu is shipped in stable and regular release cycles; a new release will
file(380): ^ • Ubuntu is entirely committed to the principles of open source software
% perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' file1
file1(1): ^ààààà