修复字符编码混乱问题

Question 1

根据您对“不是手工”的意思，iconv可能对您的任务有用。

iconv - 将文本从一种字符编码转换为另一种字符编码

选项

   -f from-encoding, --from-code=from-encoding
          Use from-encoding for input characters.

   -t to-encoding, --to-code=to-encoding
          Use to-encoding for output characters.

根据我的经验，即使您必须处理错误的编码， iconv 也能正常工作。例如，您可以告诉 iconv 输入数据是 utf-8 编码的，即使它是 iso-8859，以便 iconv 的行为就像输入是 utf-8 一样。这样您就可以修复编码错误的数据。

由于 iconv 可以用作过滤器，因此您可以将其与类似curl.当您使用时，链接 withwget也应该有效--output-document -。

据我所知， iconv 无法检测/猜测正确的输入编码。但是，根据您的输入数据的混乱程度，如果网站有（太多）不同类型的错误/混合编码，这可能是“不可能的”。如果整个网站都以同样的方式混乱，您应该能够修复它。

Answer

根据您对“不是手工”的意思，iconv可能对您的任务有用。

iconv - 将文本从一种字符编码转换为另一种字符编码

选项

   -f from-encoding, --from-code=from-encoding
          Use from-encoding for input characters.

   -t to-encoding, --to-code=to-encoding
          Use to-encoding for output characters.

根据我的经验，即使您必须处理错误的编码， iconv 也能正常工作。例如，您可以告诉 iconv 输入数据是 utf-8 编码的，即使它是 iso-8859，以便 iconv 的行为就像输入是 utf-8 一样。这样您就可以修复编码错误的数据。

由于 iconv 可以用作过滤器，因此您可以将其与类似curl.当您使用时，链接 withwget也应该有效--output-document -。

据我所知， iconv 无法检测/猜测正确的输入编码。但是，根据您的输入数据的混乱程度，如果网站有（太多）不同类型的错误/混合编码，这可能是“不可能的”。如果整个网站都以同样的方式混乱，您应该能够修复它。

Question 2

首先，您需要locale使用 UTF-8。

检测

chardetect（来自 python3-chardet 包；AKA chardet）
uchardet，编码检测器库（现在在 freedesktop 上）
enca，专注于东欧和中欧语言

file --brief --mime-encoding FILE | awk '{print $2}' FS=':[ :]+'

通常的嫌疑人是：CP850、CP437、latin1（又名 ISO-8859-1）、CP1252（又名 windows-1252）。

根据我的经验，这些工具通常无法完成这项工作。有时，一个文件可能会混合使用多种编码。

我在某个地方发现了这个蛮力方便的小脚本：

#!/bin/bash

# Usage string-encoding-detector.sh fileWithLiberaci°n.txt | grep Liberación

iconv --list | sed -e 's/\/\///g' | while read -r encoding
do
  transcoded=$(head -n1 "$1" | iconv -c -f "$encoding" -t UTF-8)
  echo "$encoding $transcoded"
done

转换

图标（受到推崇的）
重新编码

有关的

Answer

首先，您需要locale使用 UTF-8。

检测

chardetect（来自 python3-chardet 包；AKA chardet）
uchardet，编码检测器库（现在在 freedesktop 上）
enca，专注于东欧和中欧语言

file --brief --mime-encoding FILE | awk '{print $2}' FS=':[ :]+'

通常的嫌疑人是：CP850、CP437、latin1（又名 ISO-8859-1）、CP1252（又名 windows-1252）。

根据我的经验，这些工具通常无法完成这项工作。有时，一个文件可能会混合使用多种编码。

我在某个地方发现了这个蛮力方便的小脚本：

#!/bin/bash

# Usage string-encoding-detector.sh fileWithLiberaci°n.txt | grep Liberación

iconv --list | sed -e 's/\/\///g' | while read -r encoding
do
  transcoded=$(head -n1 "$1" | iconv -c -f "$encoding" -t UTF-8)
  echo "$encoding $transcoded"
done

转换

图标（受到推崇的）
重新编码

修复字符编码混乱问题

答案1

答案2

检测

转换

有关的

相关内容