除了非 C 语言环境之外,还有什么会扰乱我的排序?

除了非 C 语言环境之外,还有什么会扰乱我的排序?

我正在使用 Ubuntu 16.04_xfce xenial。这不仅仅是设置正确的区域设置,或使用“自然顺序”对操作数进行排序。

我对 apt 源文件进行了排序。所有行均以“#”、“##”或“deb”开头。我期望看到所有空白行,所有带有“#”的行,然后是“##”,最后是那些以“deb”开头的行。在我的输出中查看大约 9 行,然后是 25 行:

root@HEJ ~ $ sort /etc/apt/sources.list







## Also, please note that software in backports WILL NOT receive any review
# deb cdrom:[Xubuntu 16.04.1 LTS _Xenial Xerus_ - Release i386 (20160719)]/ xenial main multiverse restricted univer
# deb http://archive.canonical.com/ubuntu xenial partner
deb http://archive.canonical.com/ubuntu/ xenial partner
deb http://mirror.csclub.uwaterloo.ca/debian-multimedia/ stable main
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-backports main restricted universe multiverse
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial main restricted
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial multiverse
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-security main restricted
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-security multiverse
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-security universe
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial universe
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-updates main restricted
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-updates multiverse
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-updates universe
deb http://mirror.cs.pitt.edu/ubuntu/archive xenial-backports main restricted universe multiverse
deb http://mirror.cs.pitt.edu/ubuntu/archive xenial main restricted universe multiverse
deb http://mirror.cs.pitt.edu/ubuntu/archive xenial-updates main restricted universe multiverse
deb http://ppa.launchpad.net/cdemu/ppa/ubuntu xenial main
# deb http://reflection.oss.ou.edu/linuxmint/repos serena main upstream import backport
deb http://security.ubuntu.com/ubuntu/ xenial-security restricted universe multiverse main
deb http://www.4pane.co.uk/ubuntu/ xenial main
# deb-src http://archive.canonical.com/ubuntu xenial partner
# deb-src http://archive.canonical.com/ubuntu/ xenial partner
# deb-src http://mirror.csclub.uwaterloo.ca/debian-multimedia/ stable main
# deb-src http://mirror.csclub.uwaterloo.ca/debian-multimedia/ stable main
# deb-src http://mirror.cs.pitt.edu/ubuntu/archive xenial-backports main restricted universe multiverse
# deb-src http://mirror.cs.pitt.edu/ubuntu/archive xenial main restricted universe multiverse
# deb-src http://mirror.cs.pitt.edu/ubuntu/archive xenial-updates main restricted universe multiverse
# deb-src http://ppa.launchpad.net/cdemu/ppa/ubuntu xenial main
# deb-src http://reflection.oss.ou.edu/linuxmint/repos serena main upstream import backport
# deb-src http://security.ubuntu.com/ubuntu xenial-security main restricted
# deb-src http://security.ubuntu.com/ubuntu/ xenial-security main restricted universe multiverse
# deb-src http://security.ubuntu.com/ubuntu/ xenial-security main restricted universe multiverse
# deb-src http://security.ubuntu.com/ubuntu xenial-security multiverse
# deb-src http://security.ubuntu.com/ubuntu xenial-security universe
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial-backports main restricted universe multiverse
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial main restricted
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial multiverse
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial universe
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial-updates main restricted
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial-updates multiverse
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial-updates universe
# deb-src http://www.4pane.co.uk/ubuntu/ xenial main
# deb-src http://www.4pane.co.uk/ubuntu/ xenial main
# deb-src http://www.scootersoftware.com/ bcompare4 non-free
## distribution.
## extensively as that contained in the main release, although it includes
## Major bug fix updates produced after the final release of the
## multiverse WILL NOT receive any review or updates from the Ubuntu
## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu
## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu 
## N.B. software from this repository may not have been tested as
## newer versions of some applications which may provide useful features.
# newer versions of the distribution.
## or updates from the Ubuntu security team.
## 'partner' repository.
## respective vendors as a service to Ubuntu users.
## security team.
# See http://help.ubuntu.com/community/UpgradeNotes for how to upgrade to
## team.
## team, and may not be under a free licence. Please satisfy yourself as to
## team, and may not be under a free licence. Please satisfy yourself as to 
## This software is not part of Ubuntu, but is offered by Canonical and the
## Uncomment the following two lines to add software from Canonical's
## universe WILL NOT receive any review or updates from the Ubuntu security
## your rights to use the software. Also, please note that software in
## your rights to use the software. Also, please note that software in 

有效的区域设置:

root@HEJ ~ $ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

研究表明,我需要超越LC_COLLATE="en_US.UTF-8"LC_COLLATE="C.UTF-8"或者更好LC_ALL=C)才能获得合理的输出。但这里还有一个问题......


如果这只是字符顺序的问题,那么所有“#”应该一起排序,所有“##”应该一起排序。

但似乎发生的情况是“#”和“##”是从排序键中删除,我不敢相信这是排序规则的函数。

我的排序键还有什么问题?

当我们谈论整理顺序时,特定语言环境中字符的二进制顺序记录在哪里?,即按整理顺序排列的每个可能字符的人类可读列表?

(那些没有使 en_US 语言环境定义向上兼容的人的痘痘!)

答案1

这种行为纯粹是由您的语言环境控制的排序规则的函数LC_COLLATE。因为您有一个 Unicode 语言环境集,所以 glibc 在其定义的变体之一中使用指定的 Unicode 排序顺序,这试图成为一种有点“自然”的排序。

这个顺序是UTS 10 Unicode 排序算法订购,与可变排序规则元素的移位修剪,并使用(可能)默认排序规则元素表。实际上,诸如 之类的字符#以及大多数其他标点符号和空格都被视为不如以下字母数字字符之间的差异显着,并且仅用于打破联系。整个算法在标准中进行了一些详细的定义,并且变得更加复杂。

这是有时建议不是设置LANGLC_COLLATE由于这个原因。您可以改为设置LC_CTYPE(为 UTF-8)和LC_MESSAGES(为您的首选消息语言),并将排序规则保留为 POSIX 默认值。无论哪种选择都会产生连锁效应。


在您的系统上,这可能是在 中定义的/usr/share/i18n/locales/iso14651_t1_common,它包含在 中iso14651_t1,它包含在 中en_US。其他区域设置的顺序在附近的文件中定义,通常基于具有本地化更改的相同默认值(例如,sv_SE使用相同的基础,但重新排序...zåäöø、折叠vw)。该表由 选定LC_COLLATE,实际上决定了系统上的行为,并且源自 Unicode 标准(过去的版本)。在较新或较旧的系统上,使用不同的 Unicode 版本,相同的字符串可能会进行不同的比较。

其他编码将有自己的单独的表,这些表可能完全不相关。


您可以通过对包含 UTR 中提供的比较表中的字符串的文件进行排序来根据规范检查系统的行为:

demark
de‐Luge
death
deluge
☠sad
de-luge
de Luge
☠happy
de‐luge
♡sad
deLuge
de luge
♡happy
de-Luge

(这些词中既有连字符也有连字符减号)

您应该得到的顺序是:

death
deluge
de luge
de-luge
de‐luge
deLuge
de Luge
de-Luge
de‐Luge
demark
☠happy
♡happy
☠sad
♡sad

报告中对该结果给出了(一些)解释性解释:

  • 转移。连字符减号和连字符组合在一起,它们的差异不如 字母“l”
    中大小写的差异显着。
    这种分组是因为它们是
    可以忽略的,但它们的第四级差异是根据
    原始的主要顺序,这比 Unicode 顺序更直观。
    符号 ☠ 和 ♡ 在级别 1-3 中被忽略。

  • 移位修剪。请注意“deLuge”是如何在带有空格和连字符的加壳版本之间出现的。符号 ☠ 和 ♡ 在级别 1-3 中被忽略。

这有点复杂。“级别 1-3”是算法中不同级别的决胜权重,其中主要级别 1 是最重要的区分因素。这可能比您已经需要的信息更多,但您至少可以确定是指定的排序顺序产生了您看到的结果。

答案2

我认为您的sort命令可能会被别名或 shell 函数覆盖。

排序选项-d显示:

-d, --dictionary-order
     consider only blanks and alphanumeric characters

因此,如果使用此选项,它将#被忽略,即使结果与LC_ALL=C您的初始版本相同。

相关内容