ASCII 源文件检查器

Question 1

您可以使用我在 GitHub 上托管的 Python 3 脚本打印文件的所有非 ASCII 行：

GitHub: ByteCommander/encoding-check

您可以克隆或下载整个存储库，也可以简单地保存文件encoding-check并使用使其可执行chmod +x encoding-check。

然后您可以像这样运行它，使用要检查的文件作为唯一参数：

./encoding-check FILENAME如果它位于您当前的工作目录中，或者......
/path/to/encoding-check FILENAME如果它位于/path/to/，或者...
encoding-check FILENAME如果它位于$PATH环境变量的目录中，即/usr/local/bin或~/bin。

如果不带任何可选参数，它将打印出发现非 ASCII 字符的每一行及其行号。最后，有一个摘要行，告诉您文件总共有多少行以及其中有多少行包含非 ASCII 字符。

此方法保证正确解码所有 ASCII 字符并检测所有绝对非 ASCII 的内容。

下面是对包含给定内容的前 20 行的文件运行的示例install.en.txt：

$ ./encoding-check install-first20.en.txt
     9: Appendix��F, GNU General Public License.
    14: (codename "���Xenial Xerus���"), for the 64-bit PC ("amd64") architecture. It also
    18: ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
--------------------------------------------------------------------------------
20 lines in 'install-first20.en.txt', thereof 3 lines with non-ASCII characters.

但是脚本有一些额外的参数来调整检查的编码和输出格式。查看帮助并尝试它们：

$ encoding-check -h
usage: encoding-check [-h] [-e ENCODING] [-s | -c | -l] [-m] [-w] [-n] [-f N]
                     [-t]
                     FILE [FILE ...]

Show all lines of a FILE containing characters that don't match the selected
ENCODING.

positional arguments:
  FILE                  the file to be examined

optional arguments:
  -h, --help            show this help message and exit
  -e ENCODING, --encoding ENCODING
                        file encoding to test (default 'ascii')
  -s, --summary         only print the summary
  -c, --count           only print the detected line count
  -l, --lines           only print the detected lines
  -m, --only-matching   hide files without matching lines from output
  -w, --no-warnings     hide warnings from output
  -n, --no-numbers      do not show line numbers in output
  -f N, --fit-width N   trim lines to N characters, or terminal width if N=0;
                        non-printable characters like tabs will be removed
  -t, --title           print title line above each file

因为--encoding，Python 3 所知道的每个编解码器都是有效的。只需尝试一个，在最坏的情况下，您会收到一条小错误消息...

Answer

您可以使用我在 GitHub 上托管的 Python 3 脚本打印文件的所有非 ASCII 行：

GitHub: ByteCommander/encoding-check

您可以克隆或下载整个存储库，也可以简单地保存文件encoding-check并使用使其可执行chmod +x encoding-check。

然后您可以像这样运行它，使用要检查的文件作为唯一参数：

./encoding-check FILENAME如果它位于您当前的工作目录中，或者......
/path/to/encoding-check FILENAME如果它位于/path/to/，或者...
encoding-check FILENAME如果它位于$PATH环境变量的目录中，即/usr/local/bin或~/bin。

如果不带任何可选参数，它将打印出发现非 ASCII 字符的每一行及其行号。最后，有一个摘要行，告诉您文件总共有多少行以及其中有多少行包含非 ASCII 字符。

此方法保证正确解码所有 ASCII 字符并检测所有绝对非 ASCII 的内容。

下面是对包含给定内容的前 20 行的文件运行的示例install.en.txt：

$ ./encoding-check install-first20.en.txt
     9: Appendix��F, GNU General Public License.
    14: (codename "���Xenial Xerus���"), for the 64-bit PC ("amd64") architecture. It also
    18: ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
--------------------------------------------------------------------------------
20 lines in 'install-first20.en.txt', thereof 3 lines with non-ASCII characters.

但是脚本有一些额外的参数来调整检查的编码和输出格式。查看帮助并尝试它们：

$ encoding-check -h
usage: encoding-check [-h] [-e ENCODING] [-s | -c | -l] [-m] [-w] [-n] [-f N]
                     [-t]
                     FILE [FILE ...]

Show all lines of a FILE containing characters that don't match the selected
ENCODING.

positional arguments:
  FILE                  the file to be examined

optional arguments:
  -h, --help            show this help message and exit
  -e ENCODING, --encoding ENCODING
                        file encoding to test (default 'ascii')
  -s, --summary         only print the summary
  -c, --count           only print the detected line count
  -l, --lines           only print the detected lines
  -m, --only-matching   hide files without matching lines from output
  -w, --no-warnings     hide warnings from output
  -n, --no-numbers      do not show line numbers in output
  -f N, --fit-width N   trim lines to N characters, or terminal width if N=0;
                        non-printable characters like tabs will be removed
  -t, --title           print title line above each file

因为--encoding，Python 3 所知道的每个编解码器都是有效的。只需尝试一个，在最坏的情况下，您会收到一条小错误消息...

Question 2

如果您想查找非 ASCII 字符，也许您应该反转搜索以排除 ASCII 字符：

grep -Pn '[^\x00-\x7F]'

例如：

$ curl https://help.ubuntu.com/16.04/installation-guide/amd64/install.en.txt -s | grep -nP '[^\x00-\x7F]' | head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software

在第 9、330、337 和 359 行中，Unicode 不间断空格字符存在。

您得到的特定输出可能是由于grep对 UTF-8 的支持。对于 Unicode 语言环境，其中一些字符可能比较相等转换为普通 ASCII 字符。在这种情况下，强制使用 C 语言环境将显示预期结果：

$ LANG=C grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software

$ LANG=en_GB.UTF-8 grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community

Answer

如果您想查找非 ASCII 字符，也许您应该反转搜索以排除 ASCII 字符：

grep -Pn '[^\x00-\x7F]'

例如：

$ curl https://help.ubuntu.com/16.04/installation-guide/amd64/install.en.txt -s | grep -nP '[^\x00-\x7F]' | head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software

在第 9、330、337 和 359 行中，Unicode 不间断空格字符存在。

您得到的特定输出可能是由于grep对 UTF-8 的支持。对于 Unicode 语言环境，其中一些字符可能比较相等转换为普通 ASCII 字符。在这种情况下，强制使用 C 语言环境将显示预期结果：

$ LANG=C grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software

$ LANG=en_GB.UTF-8 grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community

Question 3

这个 Perl 命令基本上替代了那个grep命令（缺少的是颜色）：

perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' *.xml

n：导致 Perl 假设您的程序周围有以下循环，这使得它像 sed -n 或 awk 一样迭代文件名参数：
```
LINE:
  while (<>) {
      ...             # your program goes here
  }
```
-e：可用于输入一行程序。
/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)：如果行包含范围内的字符\x80-\xFF，则打印当前文件的名称、当前文件的行号、字符串:\t^和当前行的内容。

输出包含问题中的示例文件的示例目录和仅包含ààààà换行符的文件：

% perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' file | head -n 10
file(9):    ^AppendixÂ F, GNU General Public License.
file(14):   ^(codename "â€˜Xenial Xerusâ€™"), for the 64-bit PC ("amd64") architecture. It also
file(18):   ^â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”
file(330):  ^when things go wrong. The Installation Howto can be found in AppendixÂ A, 
file(337):  ^ChapterÂ 1.Â Welcome to Ubuntu
file(359):  ^1.1.Â What is Ubuntu?
file(368):  ^  â€¢ Ubuntu will always be free of charge, and there is no extra fee for the "
file(372):  ^  â€¢ Ubuntu includes the very best in translations and accessibility
file(376):  ^  â€¢ Ubuntu is shipped in stable and regular release cycles; a new release will
file(380):  ^  â€¢ Ubuntu is entirely committed to the principles of open source software
% perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' file1
file1(1):   ^ààààà

Answer

这个 Perl 命令基本上替代了那个grep命令（缺少的是颜色）：

perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' *.xml

n：导致 Perl 假设您的程序周围有以下循环，这使得它像 sed -n 或 awk 一样迭代文件名参数：
```
LINE:
  while (<>) {
      ...             # your program goes here
  }
```
-e：可用于输入一行程序。
/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)：如果行包含范围内的字符\x80-\xFF，则打印当前文件的名称、当前文件的行号、字符串:\t^和当前行的内容。

输出包含问题中的示例文件的示例目录和仅包含ààààà换行符的文件：

% perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' file | head -n 10
file(9):    ^AppendixÂ F, GNU General Public License.
file(14):   ^(codename "â€˜Xenial Xerusâ€™"), for the 64-bit PC ("amd64") architecture. It also
file(18):   ^â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”
file(330):  ^when things go wrong. The Installation Howto can be found in AppendixÂ A, 
file(337):  ^ChapterÂ 1.Â Welcome to Ubuntu
file(359):  ^1.1.Â What is Ubuntu?
file(368):  ^  â€¢ Ubuntu will always be free of charge, and there is no extra fee for the "
file(372):  ^  â€¢ Ubuntu includes the very best in translations and accessibility
file(376):  ^  â€¢ Ubuntu is shipped in stable and regular release cycles; a new release will
file(380):  ^  â€¢ Ubuntu is entirely committed to the principles of open source software
% perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' file1
file1(1):   ^ààààà

ASCII 源文件检查器

答案1

答案2

答案3

相关内容