我不确定我是否应该更好地问:
如何在Linux中显示Windows UTF-16编码?
因为我刚刚发现 Windows 确切地使用 UTF-16LE:
https://stackoverflow.com/q/66072117/1997354
我在通过 SSH 连接的无头服务器上运行 Debian 11 (bullseye)。我经常从格式化为包含 Win10 的 NTFS 的 SATA 硬盘复制文件。不过,我看到了一些文件残废的,或者怎么说,例如今天我偶然发现:
KÃ_¼ndigung.pdf
实际上应该显示为Kündigung.pdf
(此处已删除问题)。
'Ledové ostÅ'$'\302\231''à 1.avi'
我只是猜测的另一个例子是捷克语Ledové ostří.avi
或者也许是Ledově ostří.avi
。
ls 'Ledové ostÅ'$'\302\231''à 1.avi' | hexdump -C
00000000 4c 65 64 6f 76 c3 83 c2 a9 20 6f 73 74 c3 85 c2 |Ledov.... ost...|
00000010 99 c3 83 c2 ad 20 31 2e 61 76 69 0a |..... 1.avi.|
0000001c
是否有可能以某种方式使所有语言环境能够正确显示并能够复制所有可打印/可能的字符?
我尝试了以下方法,但没有什么好结果。一整天都在徒劳地试图实现这一目标。
任何帮助表示赞赏。
apt-cache policy locales-all
locales-all:
Installed: (none)
Candidate: 2.31-13+deb11u5
Version table:
2.31-13+deb11u5 500
500 http://deb.debian.org/debian bullseye/main amd64 Packages
500 http://deb.debian.org/debian bullseye-updates/main amd64 Packages
之后
apt-get install locales-all
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
locales-all
0 upgraded, 1 newly installed, 0 to remove and 2 not upgraded.
Need to get 10.8 MB of archives.
After this operation, 227 MB of additional disk space will be used.
Get:1 http://deb.debian.org/debian bullseye/main amd64 locales-all amd64 2.31-13+deb11u5 [10.8 MB]
Fetched 10.8 MB in 2s (5,539 kB/s)
Selecting previously unselected package locales-all.
(Reading database ... 209758 files and directories currently installed.)
Preparing to unpack .../locales-all_2.31-13+deb11u5_amd64.deb ...
Unpacking locales-all (2.31-13+deb11u5) ...
Setting up locales-all (2.31-13+deb11u5) ...
update-locale
dpkg-reconfigure locales
选择全部。
update-locale
locale-gen all
Generating locales (this might take a while)...
aa_DJ.UTF-8... done
aa_DJ.ISO-8859-1... done
aa_ER.UTF-8... done
aa_ER.UTF-8@saaho... done
aa_ET.UTF-8... done
af_ZA.UTF-8... done
af_ZA.ISO-8859-1... done
agr_PE.UTF-8... done
ak_GH.UTF-8... done
am_ET.UTF-8... done
an_ES.UTF-8... done
an_ES.ISO-8859-15... done
anp_IN.UTF-8... done
ar_AE.UTF-8... done
ar_AE.ISO-8859-6... done
ar_BH.UTF-8... done
ar_BH.ISO-8859-6... done
ar_DZ.UTF-8... done
ar_DZ.ISO-8859-6... done
ar_EG.UTF-8... done
ar_EG.ISO-8859-6... done
ar_IN.UTF-8... done
ar_IQ.UTF-8... done
ar_IQ.ISO-8859-6... done
ar_JO.UTF-8... done
ar_JO.ISO-8859-6... done
ar_KW.UTF-8... done
ar_KW.ISO-8859-6... done
ar_LB.UTF-8... done
ar_LB.ISO-8859-6... done
ar_LY.UTF-8... done
ar_LY.ISO-8859-6... done
ar_MA.UTF-8... done
ar_MA.ISO-8859-6... done
ar_OM.UTF-8... done
ar_OM.ISO-8859-6... done
ar_QA.UTF-8... done
ar_QA.ISO-8859-6... done
ar_SA.UTF-8... done
ar_SA.ISO-8859-6... done
ar_SD.UTF-8... done
ar_SD.ISO-8859-6... done
ar_SS.UTF-8... done
ar_SY.UTF-8... done
ar_SY.ISO-8859-6... done
ar_TN.UTF-8... done
ar_TN.ISO-8859-6... done
ar_YE.UTF-8... done
ar_YE.ISO-8859-6... done
ayc_PE.UTF-8... done
az_AZ.UTF-8... done
az_IR.UTF-8... done
as_IN.UTF-8... done
ast_ES.UTF-8... done
ast_ES.ISO-8859-15... done
be_BY.UTF-8... done
be_BY.CP1251... done
be_BY.UTF-8@latin... done
bem_ZM.UTF-8... done
ber_DZ.UTF-8... done
ber_MA.UTF-8... done
bg_BG.UTF-8... done
bg_BG.CP1251... done
bhb_IN.UTF-8... done
bho_IN.UTF-8... done
bho_NP.UTF-8... done
bi_VU.UTF-8... done
bn_BD.UTF-8... done
bn_IN.UTF-8... done
bo_CN.UTF-8... done
bo_IN.UTF-8... done
br_FR.UTF-8... done
br_FR.ISO-8859-1... done
br_FR.ISO-8859-15@euro... done
brx_IN.UTF-8... done
bs_BA.UTF-8... done
bs_BA.ISO-8859-2... done
byn_ER.UTF-8... done
ca_AD.UTF-8... done
ca_AD.ISO-8859-15... done
ca_ES.UTF-8... done
ca_ES.ISO-8859-1... done
ca_ES.ISO-8859-15@euro... done
ca_ES.UTF-8@valencia... done
ca_FR.UTF-8... done
ca_FR.ISO-8859-15... done
ca_IT.UTF-8... done
ca_IT.ISO-8859-15... done
ce_RU.UTF-8... done
chr_US.UTF-8... done
cmn_TW.UTF-8... done
crh_UA.UTF-8... done
cs_CZ.UTF-8... done
cs_CZ.ISO-8859-2... done
csb_PL.UTF-8... done
cv_RU.UTF-8... done
cy_GB.UTF-8... done
cy_GB.ISO-8859-14... done
da_DK.UTF-8... done
da_DK.ISO-8859-1... done
de_AT.UTF-8... done
de_AT.ISO-8859-1... done
de_AT.ISO-8859-15@euro... done
de_BE.UTF-8... done
de_BE.ISO-8859-1... done
de_BE.ISO-8859-15@euro... done
de_CH.UTF-8... done
de_CH.ISO-8859-1... done
de_DE.UTF-8... done
de_DE.ISO-8859-1... done
de_DE.ISO-8859-15@euro... done
de_IT.UTF-8... done
de_IT.ISO-8859-1... done
de_LI.UTF-8... done
de_LU.UTF-8... done
de_LU.ISO-8859-1... done
de_LU.ISO-8859-15@euro... done
doi_IN.UTF-8... done
dsb_DE.UTF-8... done
dv_MV.UTF-8... done
dz_BT.UTF-8... done
el_GR.UTF-8... done
el_GR.ISO-8859-7... done
el_GR.ISO-8859-7@euro... done
el_CY.UTF-8... done
el_CY.ISO-8859-7... done
en_AG.UTF-8... done
en_AU.UTF-8... done
en_AU.ISO-8859-1... done
en_BW.UTF-8... done
en_BW.ISO-8859-1... done
en_CA.UTF-8... done
en_CA.ISO-8859-1... done
en_DK.UTF-8... done
en_DK.ISO-8859-15... done
en_DK.ISO-8859-1... done
en_GB.UTF-8... done
en_GB.ISO-8859-1... done
en_GB.ISO-8859-15... done
en_HK.UTF-8... done
en_HK.ISO-8859-1... done
en_IE.UTF-8... done
en_IE.ISO-8859-1... done
en_IE.ISO-8859-15@euro... done
en_IL.UTF-8... done
en_IN.UTF-8... done
en_NG.UTF-8... done
en_NZ.UTF-8... done
en_NZ.ISO-8859-1... done
en_PH.UTF-8... done
en_PH.ISO-8859-1... done
en_SC.UTF-8... done
en_SG.UTF-8... done
en_SG.ISO-8859-1... done
en_US.UTF-8... done
en_US.ISO-8859-1... done
en_US.ISO-8859-15... done
en_ZA.UTF-8... done
en_ZA.ISO-8859-1... done
en_ZM.UTF-8... done
en_ZW.UTF-8... done
en_ZW.ISO-8859-1... done
eo.UTF-8... done
es_AR.UTF-8... done
es_AR.ISO-8859-1... done
es_BO.UTF-8... done
es_BO.ISO-8859-1... done
es_CL.UTF-8... done
es_CL.ISO-8859-1... done
es_CO.UTF-8... done
es_CO.ISO-8859-1... done
es_CR.UTF-8... done
es_CR.ISO-8859-1... done
es_CU.UTF-8... done
es_DO.UTF-8... done
es_DO.ISO-8859-1... done
es_EC.UTF-8... done
es_EC.ISO-8859-1... done
es_ES.UTF-8... done
es_ES.ISO-8859-1... done
es_ES.ISO-8859-15@euro... done
es_GT.UTF-8... done
es_GT.ISO-8859-1... done
es_HN.UTF-8... done
es_HN.ISO-8859-1... done
es_MX.UTF-8... done
es_MX.ISO-8859-1... done
es_NI.UTF-8... done
es_NI.ISO-8859-1... done
es_PA.UTF-8... done
es_PA.ISO-8859-1... done
es_PE.UTF-8... done
es_PE.ISO-8859-1... done
es_PR.UTF-8... done
es_PR.ISO-8859-1... done
es_PY.UTF-8... done
es_PY.ISO-8859-1... done
es_SV.UTF-8... done
es_SV.ISO-8859-1... done
es_US.UTF-8... done
es_US.ISO-8859-1... done
es_UY.UTF-8... done
es_UY.ISO-8859-1... done
es_VE.UTF-8... done
es_VE.ISO-8859-1... done
et_EE.UTF-8... done
et_EE.ISO-8859-1... done
et_EE.ISO-8859-15... done
eu_ES.UTF-8... done
eu_ES.ISO-8859-1... done
eu_ES.ISO-8859-15@euro... done
eu_FR.UTF-8... done
eu_FR.ISO-8859-1... done
eu_FR.ISO-8859-15@euro... done
fa_IR.UTF-8... done
ff_SN.UTF-8... done
fi_FI.UTF-8... done
fi_FI.ISO-8859-1... done
fi_FI.ISO-8859-15@euro... done
fil_PH.UTF-8... done
fo_FO.UTF-8... done
fo_FO.ISO-8859-1... done
fr_BE.UTF-8... done
fr_BE.ISO-8859-1... done
fr_BE.ISO-8859-15@euro... done
fr_CA.UTF-8... done
fr_CA.ISO-8859-1... done
fr_CH.UTF-8... done
fr_CH.ISO-8859-1... done
fr_FR.UTF-8... done
fr_FR.ISO-8859-1... done
fr_FR.ISO-8859-15@euro... done
fr_LU.UTF-8... done
fr_LU.ISO-8859-1... done
fr_LU.ISO-8859-15@euro... done
fur_IT.UTF-8... done
fy_NL.UTF-8... done
fy_DE.UTF-8... done
ga_IE.UTF-8... done
ga_IE.ISO-8859-1... done
ga_IE.ISO-8859-15@euro... done
gd_GB.UTF-8... done
gd_GB.ISO-8859-15... done
gez_ER.UTF-8... done
gez_ER.UTF-8@abegede... done
gez_ET.UTF-8... done
gez_ET.UTF-8@abegede... done
gl_ES.UTF-8... done
gl_ES.ISO-8859-1... done
gl_ES.ISO-8859-15@euro... done
gu_IN.UTF-8... done
gv_GB.UTF-8... done
gv_GB.ISO-8859-1... done
ha_NG.UTF-8... done
hak_TW.UTF-8... done
he_IL.UTF-8... done
he_IL.ISO-8859-8... done
hi_IN.UTF-8... done
hif_FJ.UTF-8... done
hne_IN.UTF-8... done
hr_HR.UTF-8... done
hr_HR.ISO-8859-2... done
hsb_DE.UTF-8... done
hsb_DE.ISO-8859-2... done
ht_HT.UTF-8... done
hu_HU.UTF-8... done
hu_HU.ISO-8859-2... done
hy_AM.UTF-8... done
hy_AM.ARMSCII-8... done
ia_FR.UTF-8... done
id_ID.UTF-8... done
id_ID.ISO-8859-1... done
ig_NG.UTF-8... done
ik_CA.UTF-8... done
is_IS.UTF-8... done
is_IS.ISO-8859-1... done
it_CH.UTF-8... done
it_CH.ISO-8859-1... done
it_IT.UTF-8... done
it_IT.ISO-8859-1... done
it_IT.ISO-8859-15@euro... done
iu_CA.UTF-8... done
ja_JP.UTF-8... done
ja_JP.EUC-JP... done
ka_GE.UTF-8... done
ka_GE.GEORGIAN-PS... done
kab_DZ.UTF-8... done
kk_KZ.UTF-8... done
kk_KZ.PT154... done
kk_KZ.RK1048... done
kl_GL.UTF-8... done
kl_GL.ISO-8859-1... done
km_KH.UTF-8... done
kn_IN.UTF-8... done
ko_KR.UTF-8... done
ko_KR.EUC-KR... done
kok_IN.UTF-8... done
ks_IN.UTF-8... done
ks_IN.UTF-8@devanagari... done
ku_TR.UTF-8... done
ku_TR.ISO-8859-9... done
kw_GB.UTF-8... done
kw_GB.ISO-8859-1... done
ky_KG.UTF-8... done
lb_LU.UTF-8... done
lg_UG.UTF-8... done
lg_UG.ISO-8859-10... done
li_BE.UTF-8... done
li_NL.UTF-8... done
lij_IT.UTF-8... done
ln_CD.UTF-8... done
lo_LA.UTF-8... done
lt_LT.UTF-8... done
lt_LT.ISO-8859-13... done
lv_LV.UTF-8... done
lv_LV.ISO-8859-13... done
lzh_TW.UTF-8... done
mag_IN.UTF-8... done
mai_IN.UTF-8... done
mai_NP.UTF-8... done
mfe_MU.UTF-8... done
mg_MG.UTF-8... done
mg_MG.ISO-8859-15... done
mhr_RU.UTF-8... done
mi_NZ.UTF-8... done
mi_NZ.ISO-8859-13... done
miq_NI.UTF-8... done
mjw_IN.UTF-8... done
mk_MK.UTF-8... done
mk_MK.ISO-8859-5... done
ml_IN.UTF-8... done
mn_MN.UTF-8... done
mni_IN.UTF-8... done
mnw_MM.UTF-8... done
mr_IN.UTF-8... done
ms_MY.UTF-8... done
ms_MY.ISO-8859-1... done
mt_MT.UTF-8... done
mt_MT.ISO-8859-3... done
my_MM.UTF-8... done
nan_TW.UTF-8... done
nan_TW.UTF-8@latin... done
nb_NO.UTF-8... done
nb_NO.ISO-8859-1... done
nds_DE.UTF-8... done
nds_NL.UTF-8... done
ne_NP.UTF-8... done
nhn_MX.UTF-8... done
niu_NU.UTF-8... done
niu_NZ.UTF-8... done
nl_AW.UTF-8... done
nl_BE.UTF-8... done
nl_BE.ISO-8859-1... done
nl_BE.ISO-8859-15@euro... done
nl_NL.UTF-8... done
nl_NL.ISO-8859-1... done
nl_NL.ISO-8859-15@euro... done
nn_NO.UTF-8... done
nn_NO.ISO-8859-1... done
nr_ZA.UTF-8... done
nso_ZA.UTF-8... done
oc_FR.UTF-8... done
oc_FR.ISO-8859-1... done
om_ET.UTF-8... done
om_KE.UTF-8... done
om_KE.ISO-8859-1... done
or_IN.UTF-8... done
os_RU.UTF-8... done
pa_IN.UTF-8... done
pa_PK.UTF-8... done
pap_AW.UTF-8... done
pap_CW.UTF-8... done
pl_PL.UTF-8... done
pl_PL.ISO-8859-2... done
ps_AF.UTF-8... done
pt_BR.UTF-8... done
pt_BR.ISO-8859-1... done
pt_PT.UTF-8... done
pt_PT.ISO-8859-1... done
pt_PT.ISO-8859-15@euro... done
quz_PE.UTF-8... done
raj_IN.UTF-8... done
ro_RO.UTF-8... done
ro_RO.ISO-8859-2... done
ru_RU.UTF-8... done
ru_RU.KOI8-R... done
ru_RU.ISO-8859-5... done
ru_RU.CP1251... done
ru_UA.UTF-8... done
ru_UA.KOI8-U... done
rw_RW.UTF-8... done
sa_IN.UTF-8... done
sah_RU.UTF-8... done
sat_IN.UTF-8... done
sc_IT.UTF-8... done
sd_IN.UTF-8... done
sd_IN.UTF-8@devanagari... done
se_NO.UTF-8... done
sgs_LT.UTF-8... done
shn_MM.UTF-8... done
shs_CA.UTF-8... done
si_LK.UTF-8... done
sid_ET.UTF-8... done
sk_SK.UTF-8... done
sk_SK.ISO-8859-2... done
sl_SI.UTF-8... done
sl_SI.ISO-8859-2... done
sm_WS.UTF-8... done
so_DJ.UTF-8... done
so_DJ.ISO-8859-1... done
so_ET.UTF-8... done
so_KE.UTF-8... done
so_KE.ISO-8859-1... done
so_SO.UTF-8... done
so_SO.ISO-8859-1... done
sq_AL.UTF-8... done
sq_AL.ISO-8859-1... done
sq_MK.UTF-8... done
sr_ME.UTF-8... done
sr_RS.UTF-8... done
sr_RS.UTF-8@latin... done
ss_ZA.UTF-8... done
st_ZA.UTF-8... done
st_ZA.ISO-8859-1... done
sv_FI.UTF-8... done
sv_FI.ISO-8859-1... done
sv_FI.ISO-8859-15@euro... done
sv_SE.UTF-8... done
sv_SE.ISO-8859-1... done
sv_SE.ISO-8859-15... done
sw_KE.UTF-8... done
sw_TZ.UTF-8... done
szl_PL.UTF-8... done
ta_IN.UTF-8... done
ta_LK.UTF-8... done
tcy_IN.UTF-8... done
te_IN.UTF-8... done
tg_TJ.UTF-8... done
tg_TJ.KOI8-T... done
th_TH.UTF-8... done
th_TH.TIS-620... done
the_NP.UTF-8... done
ti_ER.UTF-8... done
ti_ET.UTF-8... done
tig_ER.UTF-8... done
tk_TM.UTF-8... done
tl_PH.UTF-8... done
tl_PH.ISO-8859-1... done
tn_ZA.UTF-8... done
to_TO.UTF-8... done
tpi_PG.UTF-8... done
tr_CY.UTF-8... done
tr_CY.ISO-8859-9... done
tr_TR.UTF-8... done
tr_TR.ISO-8859-9... done
ts_ZA.UTF-8... done
tt_RU.UTF-8... done
tt_RU.UTF-8@iqtelif... done
ug_CN.UTF-8... done
uk_UA.UTF-8... done
uk_UA.KOI8-U... done
unm_US.UTF-8... done
ur_IN.UTF-8... done
ur_PK.UTF-8... done
uz_UZ.UTF-8... done
uz_UZ.ISO-8859-1... done
uz_UZ.UTF-8@cyrillic... done
ve_ZA.UTF-8... done
vi_VN.UTF-8... done
wa_BE.UTF-8... done
wa_BE.ISO-8859-1... done
wa_BE.ISO-8859-15@euro... done
wae_CH.UTF-8... done
wal_ET.UTF-8... done
wo_SN.UTF-8... done
xh_ZA.UTF-8... done
xh_ZA.ISO-8859-1... done
yi_US.UTF-8... done
yi_US.CP1255... done
yo_NG.UTF-8... done
yue_HK.UTF-8... done
yuw_PG.UTF-8... done
zh_CN.UTF-8... done
zh_CN.GB18030... done
zh_CN.GBK... done
zh_CN.GB2312... done
zh_HK.UTF-8... done
zh_HK.BIG5-HKSCS... done
zh_SG.UTF-8... done
zh_SG.GBK... done
zh_SG.GB2312... done
zh_TW.UTF-8... done
zh_TW.EUC-TW... done
zh_TW.BIG5... done
zu_ZA.UTF-8... done
zu_ZA.ISO-8859-1... done
Generation complete.
我什至重新启动了服务器,
但问题仍然存在 - 没有改善
例如,我仍然没有在此文件名中看到德语字符:
ll Downloads/Fila\ Martin\ *
-rwxrwxrwx 2 root root 180K 2021-Dec-23 'Downloads/Fila Martin - KÃ_¼ndigung.pdf'
详细说明:
\ls -l Fila\ Martin\ -\ KÃ_¼ndigung.pdf | hexdump -C
00000000 2d 72 77 78 72 77 78 72 77 78 20 32 20 72 6f 6f |-rwxrwxrwx 2 roo|
00000010 74 20 72 6f 6f 74 20 31 38 33 35 31 36 20 44 65 |t root 183516 De|
00000020 63 20 32 33 20 20 32 30 32 31 20 46 69 6c 61 20 |c 23 2021 Fila |
00000030 4d 61 72 74 69 6e 20 2d 20 4b c3 83 5f c3 82 c2 |Martin - K.._...|
00000040 bc 6e 64 69 67 75 6e 67 2e 70 64 66 0a |.ndigung.pdf.|
0000004d
和
locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=en_US.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=
我发现这个资源可能很好,但由于我不懂 Unicode,它超出了我的范围: https://www.debian.org/doc/manuals/debian-reference/ch08.en.html
AB试图帮助我,但由于某种原因,它没有成功:
这是文件名,而不是内容。我删除了我之前的评论,因为即使 Kündigung 也不会用错误的双编码来匹配第二部分(
echo -n ü | iconv -f iso-8859-1 | xxd -p
给出c383c2bc
notc382c2bc
,反过来给出一个¼
。而且德语应该已经切换到不再存在的iso-8859-15
地方。我猜¼
文件的传递方式可能会以奇怪的方式改变它的名称。
我想到的最后一个信息是我如何挂载 NTFS SATA 磁盘分区:
mount /dev/sdX4 /mnt/something/
正如您所看到的,我没有指定任何额外的开关或传递任何不必要的参数。
我从 Linux Mint 21.1 Vera 连接到该 Debian 服务器。
答案1
是否可以以某种方式启用所有语言环境
简短的回答是:不
这里的问题是,如果单个字节的值在 128-255 范围内,计算机无法猜测:该字节代表哪个区域设置的字母。这就是为什么您必须告诉操作系统要使用哪个代码页(区域设置) - 如果文件名的字节值高于 ASCII,则要在屏幕上绘制哪个字母表中的字母。
这就是 UTF8 越来越受欢迎的原因。在 UTF8 中,字符串可以有一个不可打印的字节,它表示:这个字节和接下来的几个字节描述了应该从不同的代码页打印的字母。这使我们能够拥有多语言字符串,而无需手动切换区域设置。
您当前已在多个不同的区域设置中创建了文件。你想把它们归为一处。最简单的解决方案是使用rename
工具。
切换到国家语言环境,以便文件名可读,执行rename -u Kündigung.pdf
.该-u
键会将文件从当前语言环境重命名为 unicode。