从字符串中删除相邻的重复单词

Question 1

使用 GNU awk 进行多字符RS和\s速记：

$ echo 'one one tow tow three three tow one three' |
awk -v RS='\\s+' '
    $0 != prev { out = (NR>1 ? out OFS : "") $0; prev = $0 }
    END { print out }
'
one tow three tow one three

或者，仍然是 GNU awk，但受到启发@nezabudka 的回答但进行了一些修复，以确保无论输入字段之间的空格序列如何分隔，也无论输入字段包含哪些字符，它都能正常工作，并确保输出以这样的形式结束，\n因此它是一个有效的 POSIX 文本文件：

$ echo one one tow tow three three tow one three |
awk -v RS='[[:blank:]]+' '
    $1 != prev { out = out $1 RT; prev=$1 }
    END { print out }
'
one tow three tow one three

否则使用任何 awk：

$ echo 'one one tow tow three three tow one three' |
awk '{
    out = $1
    for ( i=2; i<=NF; i++ ) {
        if ( $i != $(i-1) ) {
            out = out OFS $i
        }
    }
    print out
}'
one tow three tow one three

Answer

使用 GNU awk 进行多字符RS和\s速记：

$ echo 'one one tow tow three three tow one three' |
awk -v RS='\\s+' '
    $0 != prev { out = (NR>1 ? out OFS : "") $0; prev = $0 }
    END { print out }
'
one tow three tow one three

或者，仍然是 GNU awk，但受到启发@nezabudka 的回答但进行了一些修复，以确保无论输入字段之间的空格序列如何分隔，也无论输入字段包含哪些字符，它都能正常工作，并确保输出以这样的形式结束，\n因此它是一个有效的 POSIX 文本文件：

$ echo one one tow tow three three tow one three |
awk -v RS='[[:blank:]]+' '
    $1 != prev { out = out $1 RT; prev=$1 }
    END { print out }
'
one tow three tow one three

否则使用任何 awk：

$ echo 'one one tow tow three three tow one three' |
awk '{
    out = $1
    for ( i=2; i<=NF; i++ ) {
        if ( $i != $(i-1) ) {
            out = out OFS $i
        }
    }
    print out
}'
one tow three tow one three

Question 2

如果行不超过 2500（例如 1000）列：

echo one one tow tow three three tow one three |
    fmt -1 | uniq | fmt -1000

GNU awk：

echo one one tow tow three three tow one three |
    awk -v RS=' ' '$1 != D {printf "%s", $1 (RT?RS:ORS); D=$1}'

更新（如果您确定该行以换行结束）：

echo one one tow tow three three tow one three |
    awk -v RS='[[:space:]]' '$1 != D {printf "%s", $1 RT; D=$1}'

否则（通用方式）：

echo -n one one tow tow three three tow one three |
    awk -v RS='[[:space:]]' '$1 != D {printf "%s", $1 (RT?RT:ORS); D=$1}'

说明：
GNU 版本有一个内置变量 RT，它被分配了一个与来自 RS 的模板相对应的实际值。例如，如果[[:space:]]为 RS 变量定义了模板，则 RT 变量将被动态分配一个在每种情况下终止记录的字符 - 空格或制表或换行。如果为 RS 变量分配了字符类模板RS=[[:space:]]，则三元运算符应更改为(RT?RT:ORS)或只是RT

Answer

如果行不超过 2500（例如 1000）列：

echo one one tow tow three three tow one three |
    fmt -1 | uniq | fmt -1000

GNU awk：

echo one one tow tow three three tow one three |
    awk -v RS=' ' '$1 != D {printf "%s", $1 (RT?RS:ORS); D=$1}'

更新（如果您确定该行以换行结束）：

echo one one tow tow three three tow one three |
    awk -v RS='[[:space:]]' '$1 != D {printf "%s", $1 RT; D=$1}'

否则（通用方式）：

echo -n one one tow tow three three tow one three |
    awk -v RS='[[:space:]]' '$1 != D {printf "%s", $1 (RT?RT:ORS); D=$1}'

说明：
GNU 版本有一个内置变量 RT，它被分配了一个与来自 RS 的模板相对应的实际值。例如，如果[[:space:]]为 RS 变量定义了模板，则 RT 变量将被动态分配一个在每种情况下终止记录的字符 - 空格或制表或换行。如果为 RS 变量分配了字符类模板RS=[[:space:]]，则三元运算符应更改为(RT?RT:ORS)或只是RT

Question 3

uniq将所有单词放在不同的行后，您可以使用：

string='one one tow tow three three tow one three'
printf '%s\n' "${string// /
}" | uniq | paste -sd ' ' -

或者使用perl，允许多个空格字符来分隔单词并保留重复组之间的间距：

string='  one one tow   tow  three three tow one three '
perl -le 'print s/(?<!\S)(\S+)(\s+\1)+(?!\S)/\1/gr for @ARGV' -- "$string"

给出：

  one tow  three tow one three

与 ksh93 的${var//pattern/replacement}参数扩展运算符相同（其他一些 shell，包括 bash 复制了该运算符，但没有复制更高级的模式匹配运算符）：

$ string='  one one tow   tow  three three tow one three '
$ print -r - "${string//~(<!\S)+(\S)+(+(\s)\1)~(!\S)/\1}"
  one tow  three tow one three

或者使用zsh（另一个 shell 将类似 perl 的模式匹配运算符支持），就地修改变量：

$ string='  one one tow   tow  three three tow one three '
$ autoload regexp-replace
$ set -o rematchpcre
$ regexp-replace string '(?<!\S)(\S+)(\s+\1)+(?!\S)' '$match[1]'
$ print -r - "$string"
  one tow  three tow one three

或者fish：

$ set string '  one one tow   tow  three three tow one three '
$ string replace -a --regex '(?<!\S)(\S+)(\s+\1)+(?!\S)' '$1' $string
  one tow  three tow one three

如果您的示例中的单词全部由数字（或下划线）组成，您可以采用与 busybox 实现类似的方法，awk其中负环视 perl 运算符可以替换为\<和\>单词边界运算符（类似于 perl 的\b,所以更像(?<!\w)/(?!\w)作为 perl 环视运算符）：

$ printf '%s\n' "$string" | busybox awk '{print gensub("\\<(\\S+)(\\s+\\1)+\\>", "\\1", "g")}'
  one tow  three tow one three

如果您的单词包含除数字或下划线之外的字符，则不能使用该方法，例如它会更改one-two two three为，因为和one-two three之间有单词边界。-two

Answer

uniq将所有单词放在不同的行后，您可以使用：

string='one one tow tow three three tow one three'
printf '%s\n' "${string// /
}" | uniq | paste -sd ' ' -

或者使用perl，允许多个空格字符来分隔单词并保留重复组之间的间距：

string='  one one tow   tow  three three tow one three '
perl -le 'print s/(?<!\S)(\S+)(\s+\1)+(?!\S)/\1/gr for @ARGV' -- "$string"

给出：

  one tow  three tow one three

与 ksh93 的${var//pattern/replacement}参数扩展运算符相同（其他一些 shell，包括 bash 复制了该运算符，但没有复制更高级的模式匹配运算符）：

$ string='  one one tow   tow  three three tow one three '
$ print -r - "${string//~(<!\S)+(\S)+(+(\s)\1)~(!\S)/\1}"
  one tow  three tow one three

或者使用zsh（另一个 shell 将类似 perl 的模式匹配运算符支持），就地修改变量：

$ string='  one one tow   tow  three three tow one three '
$ autoload regexp-replace
$ set -o rematchpcre
$ regexp-replace string '(?<!\S)(\S+)(\s+\1)+(?!\S)' '$match[1]'
$ print -r - "$string"
  one tow  three tow one three

或者fish：

$ set string '  one one tow   tow  three three tow one three '
$ string replace -a --regex '(?<!\S)(\S+)(\s+\1)+(?!\S)' '$1' $string
  one tow  three tow one three

如果您的示例中的单词全部由数字（或下划线）组成，您可以采用与 busybox 实现类似的方法，awk其中负环视 perl 运算符可以替换为\<和\>单词边界运算符（类似于 perl 的\b,所以更像(?<!\w)/(?!\w)作为 perl 环视运算符）：

$ printf '%s\n' "$string" | busybox awk '{print gensub("\\<(\\S+)(\\s+\\1)+\\>", "\\1", "g")}'
  one tow  three tow one three

如果您的单词包含除数字或下划线之外的字符，则不能使用该方法，例如它会更改one-two two three为，因为和one-two three之间有单词边界。-two

Question 4

使用 Perl。例如，以下内容甚至可以跨行边界删除相邻的重复单词（使用 perl 的-0777选项一次吸收整个输入）：

$ printf 'one one two\n two two\ntwo three three two\none\nthree\nthree\n' |
    perl -0777 -p -e 's/\b(\w+)(?:\s+\1)+\b/$1/g'
one two three two
one
three

\1操作左侧 (LHS) 中的是s/search (LHS)/replace (RHS)/对先前匹配的模式组的反向引用(\w+)。 $1是替换操作或操作右侧的相同捕获组。

顺便说一句，如果不将其输入到 perl 中，输入看起来像这样，多行中包含重复的相邻单词：

$ printf 'one one two\n two two\ntwo three three two\none\nthree\nthree\n' 
one one two
 two two
two three three two
one
three
three

笔记：

\b^是一个类似于or 的锚点$，但它不匹配行的开头或结尾，而是匹配单词之间的（零宽度）边界
\w匹配任何单词字符，该字符在手册页中定义perlre为：

\w [3] 匹配“单词”字符（字母数字加“_”，加上其他连接标点符号加 Unicode 标记）

...

[3] 详细信息请参见perlunicode中的“Unicode字符属性”

如果您只想严格匹配字母（即字母）字符（没有数字或下划线），您可以使用[[:alpha:]]+而不是\w+.

如果您的输入文本可能包含 unicode 字符，有多种处理方法，但最简单的是仅使用 perl 的-C选项：

$ echo 'öne öne öne two öne one' |
    perl -C -0777 -p -e 's/\b([[:alpha:]]+)(?:\s+\1)+\b/$1/g'
öne two öne one

查看man perlrun并搜索-C有关此选项的详细信息。如果您确实对该主题感兴趣，另请参阅perlunicode、perlunitut、perluniintro和的手册页perlunifaq。正如您可能从大量文档中猜到的那样，处理 unicode 在大多数情况下是简单直接的，但在各种情况下可能相当复杂且微妙。

Answer