如何自动检测文本文件编码？

Question 1

尝试沙尔代Python 模块，可在 PyPI 上获取：

pip install chardet

然后运行chardetect myfile.txt。

Chardet 基于检测代码Mozilla 使用的，因此它应该会给出合理的结果，前提是输入文本足够长，可以进行统计分析。请阅读项目文档。

正如评论中提到的那样，它非常慢，但一些发行版也附带了原始的 C++ 版本，正如@Xavier 发现的那样https://superuser.com/a/609056某个地方也有 Java 版本。

Answer

尝试沙尔代Python 模块，可在 PyPI 上获取：

pip install chardet

然后运行chardetect myfile.txt。

Chardet 基于检测代码Mozilla 使用的，因此它应该会给出合理的结果，前提是输入文本足够长，可以进行统计分析。请阅读项目文档。

正如评论中提到的那样，它非常慢，但一些发行版也附带了原始的 C++ 版本，正如@Xavier 发现的那样https://superuser.com/a/609056某个地方也有 Java 版本。

Question 2

我会使用这个简单的命令：

encoding=$(file -bi myfile.txt)

或者如果您只想要实际的字符集（如utf-8）：

encoding=$(file -b --mime-encoding myfile.txt)

Answer

我会使用这个简单的命令：

encoding=$(file -bi myfile.txt)

或者如果您只想要实际的字符集（如utf-8）：

encoding=$(file -b --mime-encoding myfile.txt)

Question 3

在基于 Debian 的 Linux 上，乌查尔德包裹（Debian/Ubuntu）提供了一个命令行工具。请参阅下面的软件包描述：

 universal charset detection library - cli utility
 .
 uchardet is a C language binding of the original C++ implementation
 of the universal charset detection library by Mozilla.
 .
 uchardet is a encoding detector library, which takes a sequence of
 bytes in an unknown character encoding without any additional
 information, and attempts to determine the encoding of the text.
 .
 The original code of universalchardet is available at
 http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet
 .
 Techniques used by universalchardet are described at
 http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Answer

在基于 Debian 的 Linux 上，乌查尔德包裹（Debian/Ubuntu）提供了一个命令行工具。请参阅下面的软件包描述：

 universal charset detection library - cli utility
 .
 uchardet is a C language binding of the original C++ implementation
 of the universal charset detection library by Mozilla.
 .
 uchardet is a encoding detector library, which takes a sequence of
 bytes in an unknown character encoding without any additional
 information, and attempts to determine the encoding of the text.
 .
 The original code of universalchardet is available at
 http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet
 .
 Techniques used by universalchardet are described at
 http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Question 4

对于经常使用 Emacs 的人来说，他们可能会发现以下内容很有用（允许手动检查和验证转换）。

而且我经常发现 Emacs 字符集自动检测比其他字符集自动检测工具（例如 chardet）效率更高。

(setq paths (mapcar 'file-truename '(
 "path/to/file1"
 "path/to/file2"
 "path/to/file3"
)))

(dolist (path paths)
  (find-file path)
  (set-buffer-file-coding-system 'utf-8-unix)
  )

然后，只需以该脚本作为参数简单调用 Emacs（参见“-l”选项）即可完成该工作。

Answer

对于经常使用 Emacs 的人来说，他们可能会发现以下内容很有用（允许手动检查和验证转换）。

而且我经常发现 Emacs 字符集自动检测比其他字符集自动检测工具（例如 chardet）效率更高。

(setq paths (mapcar 'file-truename '(
 "path/to/file1"
 "path/to/file2"
 "path/to/file3"
)))

(dolist (path paths)
  (find-file path)
  (set-buffer-file-coding-system 'utf-8-unix)
  )

然后，只需以该脚本作为参数简单调用 Emacs（参见“-l”选项）即可完成该工作。

如何自动检测文本文件编码？

答案1

答案2

答案3

答案4

相关内容