如何有条件地重新编码为 UTF-8？

Question 1

这条消息已经很老了，但我认为我可以解决这个问题：
首先创建一个名为需要重新编码：

#!/bin/bash
# Find the current encoding of the file
encoding=$(file -i "$2" | sed "s/.*charset=\(.*\)$/\1/")

if [ ! "$1" == "${encoding}" ]
then
# Encodings differ, we have to encode
echo "recoding from ${encoding} to $1 file : $2"
recode ${encoding}..$1 $2
fi

您可以这样使用它：

recodeifneeded utf-8 file.txt

因此，如果您想递归运行它并将所有 *.txt 文件编码更改为（假设） utf-8 ：

find . -name "*.txt" -exec recodeifneeded utf-8 {} \;

我希望这有帮助。

Answer

这条消息已经很老了，但我认为我可以解决这个问题：
首先创建一个名为需要重新编码：

#!/bin/bash
# Find the current encoding of the file
encoding=$(file -i "$2" | sed "s/.*charset=\(.*\)$/\1/")

if [ ! "$1" == "${encoding}" ]
then
# Encodings differ, we have to encode
echo "recoding from ${encoding} to $1 file : $2"
recode ${encoding}..$1 $2
fi

您可以这样使用它：

recodeifneeded utf-8 file.txt

因此，如果您想递归运行它并将所有 *.txt 文件编码更改为（假设） utf-8 ：

find . -name "*.txt" -exec recodeifneeded utf-8 {} \;

我希望这有帮助。

Question 2

该脚本改编自harrymc 的想法它有条件地（基于某些 UTF-8 编码的斯堪的纳维亚字符的存在）对一个文件进行重新编码，对我来说似乎效果很好。

$ cat recode-to-utf8.sh 

#!/bin/sh
# Recodes specified file to UTF-8, except if it seems to be UTF-8 already

result=`grep -c [åäöÅÄÖ] $1` 
if [ "$result" -eq "0" ]
then
    echo "Recoding $1 from ISO-8859-1 to UTF-8"
    recode ISO-8859-1..UTF-8 $1 # overwrites file
else
    echo "$1 was already UTF-8 (probably); skipping it"
fi

（批处理文件当然是一个简单的事情，例如for f in *txt; do recode-to-utf8.sh $f; done。）

注意：：这完全取决于脚本文件本身是否为 UTF-8。这显然是一种非常有限的解决方案，适合我碰巧拥有的文件类型，欢迎添加更好的答案以更通用的方式解决问题。

Answer

该脚本改编自harrymc 的想法它有条件地（基于某些 UTF-8 编码的斯堪的纳维亚字符的存在）对一个文件进行重新编码，对我来说似乎效果很好。

$ cat recode-to-utf8.sh 

#!/bin/sh
# Recodes specified file to UTF-8, except if it seems to be UTF-8 already

result=`grep -c [åäöÅÄÖ] $1` 
if [ "$result" -eq "0" ]
then
    echo "Recoding $1 from ISO-8859-1 to UTF-8"
    recode ISO-8859-1..UTF-8 $1 # overwrites file
else
    echo "$1 was already UTF-8 (probably); skipping it"
fi

（批处理文件当然是一个简单的事情，例如for f in *txt; do recode-to-utf8.sh $f; done。）

注意：：这完全取决于脚本文件本身是否为 UTF-8。这显然是一种非常有限的解决方案，适合我碰巧拥有的文件类型，欢迎添加更好的答案以更通用的方式解决问题。

Question 3

UTF-8 对于哪些字节序列有效有严格的规定。这意味着如果数据可以是 UTF-8，如果你假设它是。

因此你可以做这样的事情（在 Python 中）：

def convert_to_utf8(data):
    try:
        data.decode('UTF-8')
        return data  # was already UTF-8
    except UnicodeError:
        return data.decode('ISO-8859-1').encode('UTF-8')

在 shell 脚本中，您可以使用iconv它来执行转换，但您需要一种检测 UTF-8 的方法。一种方法是使用iconvUTF-8 作为源和目标编码。如果文件是有效的 UTF-8，则输出将与输入相同。

Answer

UTF-8 对于哪些字节序列有效有严格的规定。这意味着如果数据可以是 UTF-8，如果你假设它是。

因此你可以做这样的事情（在 Python 中）：

def convert_to_utf8(data):
    try:
        data.decode('UTF-8')
        return data  # was already UTF-8
    except UnicodeError:
        return data.decode('ISO-8859-1').encode('UTF-8')

在 shell 脚本中，您可以使用iconv它来执行转换，但您需要一种检测 UTF-8 的方法。一种方法是使用iconvUTF-8 作为源和目标编码。如果文件是有效的 UTF-8，则输出将与输入相同。

Question 4

我有点晚了，但我一直在为同样的问题苦苦挣扎……现在我找到了一个很好的方法，我忍不住要分享它:)

尽管我是 emacs 用户，但我今天还是建议您使用 vim。

使用这个简单的命令，它将重新编码你的文件，无论里面的内容是什么，都按照所需的编码进行：

vim +'set nobomb | set fenc=utf8 | x' <filename>

我从来没有发现过比这能给我带来更好结果的东西。

我希望它能够对其他人有所帮助。

Answer