如何在 Linux 上辨别文件名的语言编码？

Question 1

实际上，没有 100% 准确的方法，但有一种方法可以给出一个很好的猜测。

有一个 Python 库 chardet，可在此处使用：https://pypi.python.org/pypi/chardet

例如

查看当前 LANG 变量的设置：

$ echo $LANG
en_IE.UTF-8

创建需要使用 UTF-8 编码的文件名

$ touch mÉ.txt

更改我们的编码，看看当我们尝试列出它时会发生什么

$ ls m*
mÉ.txt
$ export LANG=C
$ ls m*
m??.txt

好的，现在我们有一个用 UTF-8 编码的文件名，并且我们当前的语言环境是 C（标准 Unix 代码页）。

因此，启动 python，导入 chardet 并让它读取文件名。我使用一些 shell 通配符（即通过 * 通配符扩展）来获取我的文件。将“ls m*”更改为与您的示例文件之一匹配的任何内容。

>>> import chardet
>>> import os
>>> chardet.detect(os.popen("ls m*").read())
{'confidence': 0.505, 'encoding': 'utf-8'}

如您所见，这只是一个猜测。“confidence”变量显示猜测的准确性。

Answer

实际上，没有 100% 准确的方法，但有一种方法可以给出一个很好的猜测。

有一个 Python 库 chardet，可在此处使用：https://pypi.python.org/pypi/chardet

例如

查看当前 LANG 变量的设置：

$ echo $LANG
en_IE.UTF-8

创建需要使用 UTF-8 编码的文件名

$ touch mÉ.txt

更改我们的编码，看看当我们尝试列出它时会发生什么

$ ls m*
mÉ.txt
$ export LANG=C
$ ls m*
m??.txt

好的，现在我们有一个用 UTF-8 编码的文件名，并且我们当前的语言环境是 C（标准 Unix 代码页）。

因此，启动 python，导入 chardet 并让它读取文件名。我使用一些 shell 通配符（即通过 * 通配符扩展）来获取我的文件。将“ls m*”更改为与您的示例文件之一匹配的任何内容。

>>> import chardet
>>> import os
>>> chardet.detect(os.popen("ls m*").read())
{'confidence': 0.505, 'encoding': 'utf-8'}

如您所见，这只是一个猜测。“confidence”变量显示猜测的准确性。

Question 2

您可能会发现这很有用，可以测试当前工作目录（python 2.7）：

import chardet
import os  

for n in os.listdir('.'):
    print '%s => %s (%s)' % (n, chardet.detect(n)['encoding'], chardet.detect(n)['confidence'])

结果如下：

Vorlagen => ascii (1.0)
examples.desktop => ascii (1.0)
Öffentlich => ISO-8859-2 (0.755682154041)
Videos => ascii (1.0)
.bash_history => ascii (1.0)
Arbeitsfläche => EUC-KR (0.99)

要从当前目录递归路径，请将其剪切并粘贴到一个小的 Python 脚本中：

#!/usr/bin/python

import chardet
import os

for root, dirs, names in os.walk('.'):
    print root
    for n in names:
        print '%s => %s (%s)' % (n, chardet.detect(n)['encoding'], chardet.detect(n)['confidence'])

Answer

您可能会发现这很有用，可以测试当前工作目录（python 2.7）：

import chardet
import os  

for n in os.listdir('.'):
    print '%s => %s (%s)' % (n, chardet.detect(n)['encoding'], chardet.detect(n)['confidence'])

结果如下：

Vorlagen => ascii (1.0)
examples.desktop => ascii (1.0)
Öffentlich => ISO-8859-2 (0.755682154041)
Videos => ascii (1.0)
.bash_history => ascii (1.0)
Arbeitsfläche => EUC-KR (0.99)

要从当前目录递归路径，请将其剪切并粘贴到一个小的 Python 脚本中：

#!/usr/bin/python

import chardet
import os

for root, dirs, names in os.walk('.'):
    print root
    for n in names:
        print '%s => %s (%s)' % (n, chardet.detect(n)['encoding'], chardet.detect(n)['confidence'])

Question 3

2021 年使用 python3 登陆这里时，我发现 @philip-reynoldsn @klaus-kappel 的答案很有用，但不再起作用，因为chardet.detect()需要一个字节类对象。我稍微编辑了代码以获取当前工作目录中所有文件的编码，如下所示：

import os
import chardet
for n in os.listdir('.'):
    chardet.detect(os.fsencode(n))

Answer

2021 年使用 python3 登陆这里时，我发现 @philip-reynoldsn @klaus-kappel 的答案很有用，但不再起作用，因为chardet.detect()需要一个字节类对象。我稍微编辑了代码以获取当前工作目录中所有文件的编码，如下所示：

import os
import chardet
for n in os.listdir('.'):
    chardet.detect(os.fsencode(n))

如何在 Linux 上辨别文件名的语言编码？

答案1

答案2

答案3

相关内容