将head和记事本可以读取的二进制编码转换为UTF-8

Question 1

“二进制”不是编码（字符集名称）。 iconv 需要一个编码名称来完成其工作。

file当实用程序无法识别文件格式时，它不会提供有用的信息。例如，它可能UTF-16没有字节编码标记 (BOM)。 notepad读到了。这同样适用于UTF-8（并且head会显示那因为你的终端可能设置为 UTF-8 编码，并且它不会关心 BOM）。

如果文件是 UTF-16，您的终端将显示该文件，head因为大多数字符都是 ASCII（甚至 Latin-1），从而使 UTF-16 字符的“其他”字节为空。

无论哪种情况，缺少 BOM 都会（取决于的版本file）使其感到困惑。但其他程序也可能有效，因为这些文件格式可以与 Microsoft Windows 以及可以在 Windows 上运行的便携式应用程序一起使用。

要将文件转换为 UTF-8，您必须知道它使用哪种编码，以及该编码的名称带有iconv.如果已经是UTF-8，那么是否添加BOM（在开头）是可选的。 UTF-16 有两种风格，根据哪种风格，字节优先。或者你可以甚至有 UTF-32。 iconv -l列出这些：

ISO-10646/UTF-8/
ISO-10646/UTF8/
UTF-7//
UTF-8//
UTF-16//
UTF-16BE//
UTF-16LE//
UTF-32//
UTF-32BE//
UTF-32LE//
UTF7//
UTF8//
UTF16//
UTF16BE//
UTF16LE//
UTF32//
UTF32BE//
UTF32LE//

“LE”和“BE”指的是字节顺序的小端和大端。 Windows 使用“LE”风格，并且iconv可能假设缺乏“LE”或“BE”的风格。

您可以使用八进制（原文如此）转储来查看这一点：

$ od -bc big-end
0000000 000 124 000 150 000 165 000 040 000 101 000 165 000 147 000 040
         \0   T  \0   h  \0   u  \0      \0   A  \0   u  \0   g  \0    
0000020 000 061 000 070 000 040 000 060 000 065 000 072 000 060 000 061
         \0   1  \0   8  \0      \0   0  \0   5  \0   :  \0   0  \0   1
0000040 000 072 000 065 000 067 000 040 000 105 000 104 000 124 000 040
         \0   :  \0   5  \0   7  \0      \0   E  \0   D  \0   T  \0    
0000060 000 062 000 060 000 061 000 066 000 012
         \0   2  \0   0  \0   1  \0   6  \0  \n
0000072

$ od -bc little-end
0000000 124 000 150 000 165 000 040 000 101 000 165 000 147 000 040 000
          T  \0   h  \0   u  \0      \0   A  \0   u  \0   g  \0      \0
0000020 061 000 070 000 040 000 060 000 065 000 072 000 060 000 061 000
          1  \0   8  \0      \0   0  \0   5  \0   :  \0   0  \0   1  \0
0000040 072 000 065 000 067 000 040 000 105 000 104 000 124 000 040 000
          :  \0   5  \0   7  \0      \0   E  \0   D  \0   T  \0      \0
0000060 062 000 060 000 061 000 066 000 012 000
          2  \0   0  \0   1  \0   6  \0  \n  \0
0000072

假设 UTF-16LE，您可以使用进行转换

iconv -f UTF-16LE// -t UTF-8// <input >output

Answer

“二进制”不是编码（字符集名称）。 iconv 需要一个编码名称来完成其工作。

file当实用程序无法识别文件格式时，它不会提供有用的信息。例如，它可能UTF-16没有字节编码标记 (BOM)。 notepad读到了。这同样适用于UTF-8（并且head会显示那因为你的终端可能设置为 UTF-8 编码，并且它不会关心 BOM）。

如果文件是 UTF-16，您的终端将显示该文件，head因为大多数字符都是 ASCII（甚至 Latin-1），从而使 UTF-16 字符的“其他”字节为空。

无论哪种情况，缺少 BOM 都会（取决于的版本file）使其感到困惑。但其他程序也可能有效，因为这些文件格式可以与 Microsoft Windows 以及可以在 Windows 上运行的便携式应用程序一起使用。

要将文件转换为 UTF-8，您必须知道它使用哪种编码，以及该编码的名称带有iconv.如果已经是UTF-8，那么是否添加BOM（在开头）是可选的。 UTF-16 有两种风格，根据哪种风格，字节优先。或者你可以甚至有 UTF-32。 iconv -l列出这些：

ISO-10646/UTF-8/
ISO-10646/UTF8/
UTF-7//
UTF-8//
UTF-16//
UTF-16BE//
UTF-16LE//
UTF-32//
UTF-32BE//
UTF-32LE//
UTF7//
UTF8//
UTF16//
UTF16BE//
UTF16LE//
UTF32//
UTF32BE//
UTF32LE//

“LE”和“BE”指的是字节顺序的小端和大端。 Windows 使用“LE”风格，并且iconv可能假设缺乏“LE”或“BE”的风格。

您可以使用八进制（原文如此）转储来查看这一点：

$ od -bc big-end
0000000 000 124 000 150 000 165 000 040 000 101 000 165 000 147 000 040
         \0   T  \0   h  \0   u  \0      \0   A  \0   u  \0   g  \0    
0000020 000 061 000 070 000 040 000 060 000 065 000 072 000 060 000 061
         \0   1  \0   8  \0      \0   0  \0   5  \0   :  \0   0  \0   1
0000040 000 072 000 065 000 067 000 040 000 105 000 104 000 124 000 040
         \0   :  \0   5  \0   7  \0      \0   E  \0   D  \0   T  \0    
0000060 000 062 000 060 000 061 000 066 000 012
         \0   2  \0   0  \0   1  \0   6  \0  \n
0000072

$ od -bc little-end
0000000 124 000 150 000 165 000 040 000 101 000 165 000 147 000 040 000
          T  \0   h  \0   u  \0      \0   A  \0   u  \0   g  \0      \0
0000020 061 000 070 000 040 000 060 000 065 000 072 000 060 000 061 000
          1  \0   8  \0      \0   0  \0   5  \0   :  \0   0  \0   1  \0
0000040 072 000 065 000 067 000 040 000 105 000 104 000 124 000 040 000
          :  \0   5  \0   7  \0      \0   E  \0   D  \0   T  \0      \0
0000060 062 000 060 000 061 000 066 000 012 000
          2  \0   0  \0   1  \0   6  \0  \n  \0
0000072

假设 UTF-16LE，您可以使用进行转换

iconv -f UTF-16LE// -t UTF-8// <input >output

Question 2

stringsiconv（来自 binutils）在两者都recode失败时成功“打印文件中可打印字符串” ，但file仍将内容报告为二进制数据：

$ file -i /tmp/textFile
/tmp/textFile: application/octet-stream; charset=binary

$ chardetect /tmp/textFile
/tmp/textFile: utf-8 with confidence 0.99

$ iconv -f utf-8 -t utf-8 /tmp/textFile -o /tmp/textFile.iconv
$ file -i /tmp/textFile.iconv
/tmp/textFile.iconv: application/octet-stream; charset=binary

$ cp /tmp/textFile /tmp/textFile.recode ; recode utf-8 /tmp/textFile.recode
$ file -i /tmp/textFile.recode 
/tmp/textFile.recode: application/octet-stream; charset=binary

$ strings /tmp/textFile > /tmp/textFile.strings
$ file -i /tmp/textFile.strings
/tmp/textFile.strings: text/plain; charset=us-ascii

Answer

stringsiconv（来自 binutils）在两者都recode失败时成功“打印文件中可打印字符串” ，但file仍将内容报告为二进制数据：

$ file -i /tmp/textFile
/tmp/textFile: application/octet-stream; charset=binary

$ chardetect /tmp/textFile
/tmp/textFile: utf-8 with confidence 0.99

$ iconv -f utf-8 -t utf-8 /tmp/textFile -o /tmp/textFile.iconv
$ file -i /tmp/textFile.iconv
/tmp/textFile.iconv: application/octet-stream; charset=binary

$ cp /tmp/textFile /tmp/textFile.recode ; recode utf-8 /tmp/textFile.recode
$ file -i /tmp/textFile.recode 
/tmp/textFile.recode: application/octet-stream; charset=binary

$ strings /tmp/textFile > /tmp/textFile.strings
$ file -i /tmp/textFile.strings
/tmp/textFile.strings: text/plain; charset=us-ascii

Question 3

https://pypi.python.org/pypi/chardet可用于确定文本的编码，然后您可以将其转换为您需要的编码。

pip install chardet
chardetect /my/path/to/file

file -i打印时

application/octet-stream; charset=binary

chardet正确检测到

ascii with confidence 1.0

Answer

https://pypi.python.org/pypi/chardet可用于确定文本的编码，然后您可以将其转换为您需要的编码。

pip install chardet
chardetect /my/path/to/file

file -i打印时

application/octet-stream; charset=binary

chardet正确检测到

ascii with confidence 1.0

Question 4

首先我想说使用此命令检查文件的 mime 类型

file -b --mime-type <yourfile>
file -b <yourfile>

一旦你看到application/octet-stream然后输入这个命令cat <yourfile> | tr -d '\0' > <yournewfile>

Answer

首先我想说使用此命令检查文件的 mime 类型

file -b --mime-type <yourfile>
file -b <yourfile>

一旦你看到application/octet-stream然后输入这个命令cat <yourfile> | tr -d '\0' > <yournewfile>

将head和记事本可以读取的二进制编码转换为UTF-8

答案1

答案2

答案3

答案4

相关内容