如何使用 sed、awk 或 grep 从每行提取多个模式

Question 1

假设

输入是一个文本文件，其中包含由空白字符序列分隔的字符串（非空白字符序列）。每行包含一个特定的单词（在运行时已知），后跟一个字符串（不一定立即），该字符串是版本号形式的数字。（显然这仅意味着它以数字开头。）

必须能够在运行时指定要查找的单词作为参数。例如，要搜索单词科技，我们应该能够这样说

word=tech

并让命令（或脚本）使用$word.单词应该完全匹配；例如，“technology”、“nanotech”和“Tech”不应匹配。该单词应仅包含字母、数字和_（下划线）——标点符号，尤其是正则表达式中的特殊字符——可能会产生不需要的结果。对于每个限定行，该命令应输出单词和数字，并用空格分隔（没有其他内容）。如果文件包含不符合这些假设的行（例如，不包含所需的单词或任何数字），则行为未定义。特别地，可以简单地忽略此类不合格线。

对于以下所有命令， $word将假定其定义如上所述。

注意：这些命令中的每一个都可以用不同的方式表达。在某些情况下，差异是微不足道的。

`grep`

清楚的`grep`

我不知道该怎么做。

简单`grep`有协助

命令

grep    "\<$word\>\|\<[[:digit:]][[:graph:]]*\>"

将匹配包含以下内容的每一行任何一个单词 ( \<$word\>) 或 ( \|) 数字 ( \<[[:digit:]][[:graph:]]*\>)。（[[:graph:]]表示字母、数字或标点符号；即除空白之外的任何字符。）此命令在--colormode 下的输出有点有趣：

grep -o "\<$word\>\|\<[[:digit:]][[:graph:]]*\>"

输出每个匹配的字符串——并且仅输出匹配的字符串字符串— 在单独的行上：

tech
1.2
tech
1
tech
0.1
tech
10.1.3
tech
7.5
tech
8.0
tech
0.x
tech
1.3.x
tech
5.x
tech
2.0.4x

那么我们就这样做

grep -o "\<$word\>\|\<[[:数字:]][[:图形:]]*\>"（输入文件）| sed "/$word/ { N; s/\n/ / }"

获取上述输出并连接包含单词 (科技）和以下行（用空格分隔）：

tech 1.2
tech 1
tech 0.1
tech 10.1.3
tech 7.5
tech 8.0
tech 0.x
tech 1.3.x
tech 5.x
tech 2.0.4x

`pcregrep`

pcregrep -o1 -o2 --om-separator=' ' "\b($word)\b.*?\b(\d\S*)"

匹配这个词和一个数字（\b是单词边界， \d是数字，\S是除空格之外的任何字符），将它们中的每一个捕获在(…)组中。然后它用于-o仅输出匹配的字符串 - 但是，在中pcregrep，您可以说-o1 -o2输出捕获组 1 和 2。--om-separator=' '显然，指定在字符串之间放置的内容。

注意：由于这里使用.*?（非贪婪匹配），因此如果输入行中有多个数字，则会找到第一个。其他命令将找到最后一个。

`sed`

sed -n "s/.*\(\<$word\>\).*[[:blank:]]\(\<[[:digit:]][[:graph:]]*\).*/\1 \2/p"

与该命令类似pcregrep，这会匹配捕获组中的字符串，然后将它们输出为\1 \2.

`awk`

awk -v the_word="$word" '
    {
        w=0             # Index of word
        n=0             # Index of number
        for (i=0; i<=NF; i++) {
                if ($i == the_word) w=i
                if (substr($i,1,1) ~ /[[:digit:]]/) n=i
        }
        if (w>0  &&  n>w) print $w, $n
    }'

这将查找单词 ( the_word) 和一个数字（第一个字符是数字的字符串）。如果它找到它们，则按顺序打印它们。

注意：只有当单词完全独立时，才会识别该单词。如果触及标点符号，其他命令将匹配它；例如，

The cyber clock goes tech, tock …
This contains the word (tech) …

Answer

假设

输入是一个文本文件，其中包含由空白字符序列分隔的字符串（非空白字符序列）。每行包含一个特定的单词（在运行时已知），后跟一个字符串（不一定立即），该字符串是版本号形式的数字。（显然这仅意味着它以数字开头。）

必须能够在运行时指定要查找的单词作为参数。例如，要搜索单词科技，我们应该能够这样说

word=tech

并让命令（或脚本）使用$word.单词应该完全匹配；例如，“technology”、“nanotech”和“Tech”不应匹配。该单词应仅包含字母、数字和_（下划线）——标点符号，尤其是正则表达式中的特殊字符——可能会产生不需要的结果。对于每个限定行，该命令应输出单词和数字，并用空格分隔（没有其他内容）。如果文件包含不符合这些假设的行（例如，不包含所需的单词或任何数字），则行为未定义。特别地，可以简单地忽略此类不合格线。

对于以下所有命令， $word将假定其定义如上所述。

注意：这些命令中的每一个都可以用不同的方式表达。在某些情况下，差异是微不足道的。

`grep`

清楚的`grep`

我不知道该怎么做。

简单`grep`有协助

命令

grep    "\<$word\>\|\<[[:digit:]][[:graph:]]*\>"

将匹配包含以下内容的每一行任何一个单词 ( \<$word\>) 或 ( \|) 数字 ( \<[[:digit:]][[:graph:]]*\>)。（[[:graph:]]表示字母、数字或标点符号；即除空白之外的任何字符。）此命令在--colormode 下的输出有点有趣：

grep -o "\<$word\>\|\<[[:digit:]][[:graph:]]*\>"

输出每个匹配的字符串——并且仅输出匹配的字符串字符串— 在单独的行上：

tech
1.2
tech
1
tech
0.1
tech
10.1.3
tech
7.5
tech
8.0
tech
0.x
tech
1.3.x
tech
5.x
tech
2.0.4x

那么我们就这样做

grep -o "\<$word\>\|\<[[:数字:]][[:图形:]]*\>"（输入文件）| sed "/$word/ { N; s/\n/ / }"

获取上述输出并连接包含单词 (科技）和以下行（用空格分隔）：

tech 1.2
tech 1
tech 0.1
tech 10.1.3
tech 7.5
tech 8.0
tech 0.x
tech 1.3.x
tech 5.x
tech 2.0.4x

`pcregrep`

pcregrep -o1 -o2 --om-separator=' ' "\b($word)\b.*?\b(\d\S*)"

匹配这个词和一个数字（\b是单词边界， \d是数字，\S是除空格之外的任何字符），将它们中的每一个捕获在(…)组中。然后它用于-o仅输出匹配的字符串 - 但是，在中pcregrep，您可以说-o1 -o2输出捕获组 1 和 2。--om-separator=' '显然，指定在字符串之间放置的内容。

注意：由于这里使用.*?（非贪婪匹配），因此如果输入行中有多个数字，则会找到第一个。其他命令将找到最后一个。

`sed`

sed -n "s/.*\(\<$word\>\).*[[:blank:]]\(\<[[:digit:]][[:graph:]]*\).*/\1 \2/p"

与该命令类似pcregrep，这会匹配捕获组中的字符串，然后将它们输出为\1 \2.

`awk`

awk -v the_word="$word" '
    {
        w=0             # Index of word
        n=0             # Index of number
        for (i=0; i<=NF; i++) {
                if ($i == the_word) w=i
                if (substr($i,1,1) ~ /[[:digit:]]/) n=i
        }
        if (w>0  &&  n>w) print $w, $n
    }'

这将查找单词 ( the_word) 和一个数字（第一个字符是数字的字符串）。如果它找到它们，则按顺序打印它们。

注意：只有当单词完全独立时，才会识别该单词。如果触及标点符号，其他命令将匹配它；例如，

The cyber clock goes tech, tock …
This contains the word (tech) …

Question 2

下面的代码应该可以达到预期效果。

searchword="tech"
(cat << EOF
tech -this is sample and ... 1.2 sample
bla bla bla test tech 1 one two
foo bar foo bar tech and [email protected] 0.1
bar bar bar foo foo tech 10.1.3
tech 7.5 and continue this line
sample_test one two tech 8.0
one two there tech 0.x *sample bla bla
www sample tech 1.3.x foo bar $foo #bar
foobar barbar bla bla tech 5.x
tech testtest 2.0.4x this is sample
EOF
) | grep $searchword |\
grep -o '\b[0-9x][0-9x]*\b\|\b[0-9][0-9]*\.[0-9x][0-9x]*\b\|\b[0-9][0-9]*\.[0-9][0-9]*\.[0-9x][0-9x]*\b' |\
sed "s/^/$searchword /"

会给你带来

tech 1.2
tech 1
tech 0.1
tech 10.1.3
tech 7.5
tech 8.0
tech 0.x
tech 1.3.x
tech 5.x
tech 2.0.4x

至少与

巴什GNU bash, version 4.4.5(1)-release
sedsed (GNU sed) 4.2.2
grepgrep (GNU grep) 2.27

我很高兴这个答案对您有帮助，否则您会考虑在问题中更明确、更解释

Answer

下面的代码应该可以达到预期效果。

searchword="tech"
(cat << EOF
tech -this is sample and ... 1.2 sample
bla bla bla test tech 1 one two
foo bar foo bar tech and [email protected] 0.1
bar bar bar foo foo tech 10.1.3
tech 7.5 and continue this line
sample_test one two tech 8.0
one two there tech 0.x *sample bla bla
www sample tech 1.3.x foo bar $foo #bar
foobar barbar bla bla tech 5.x
tech testtest 2.0.4x this is sample
EOF
) | grep $searchword |\
grep -o '\b[0-9x][0-9x]*\b\|\b[0-9][0-9]*\.[0-9x][0-9x]*\b\|\b[0-9][0-9]*\.[0-9][0-9]*\.[0-9x][0-9x]*\b' |\
sed "s/^/$searchword /"

会给你带来

tech 1.2
tech 1
tech 0.1
tech 10.1.3
tech 7.5
tech 8.0
tech 0.x
tech 1.3.x
tech 5.x
tech 2.0.4x

至少与

巴什GNU bash, version 4.4.5(1)-release
sedsed (GNU sed) 4.2.2
grepgrep (GNU grep) 2.27

我很高兴这个答案对您有帮助，否则您会考虑在问题中更明确、更解释

Question 3

我无法在 grep 或 sed 中完全得到这个，但 Perl 来拯救：

$ perl -e 'while($stdin = <>) {@matches = $stdin =~ /(tech)[^0-9]*([0-9x][0-9x.]*)/g; print "@matches\n" if @matches}' < INPUTFILE

请注意，这< INTPUTFILE仅适用于通过 stdin 将文件输入到脚本中； stdin 可以从任何其他源提供（例如管道、<<<字符串重定向）。

解释：

# 将标准输入（在 Perl 中）分配给变量以使用它进行匹配
# 并检测输入何时为空（即 EOF）
while ($stdin = ) {
    # 将 $stdin 中的匹配分配给数组 @matches
    #/(技术)[^0-9]*([0-9x][0-9x.]*)/g;
    # (tech): 匹配术语“tech”，() 将其放入@matches
    # [^0-9]*: 匹配任何要丢弃的非数字，因为 .* 太贪婪了
    # 并匹配到最后一个数字
    # ([0-9x][0-9x.]*)：匹配模式0、0.x、0.9.x等，()将其放入@matches中
    @matches = $stdin =~ /(tech)[^0-9]*([0-9x][0-9x.]*)/g;
    # 如果@matches不为空则显示上面的数组
    如果@matches则打印“@matches\n”
}

将其应用于文件时，输出为：

科技1.2
技术1
科技0.1
技术10.1.3
科技7.5
科技8.0
技术1.3.x
技术5.x
技术2.0.4x

Answer

我无法在 grep 或 sed 中完全得到这个，但 Perl 来拯救：

$ perl -e 'while($stdin = <>) {@matches = $stdin =~ /(tech)[^0-9]*([0-9x][0-9x.]*)/g; print "@matches\n" if @matches}' < INPUTFILE

请注意，这< INTPUTFILE仅适用于通过 stdin 将文件输入到脚本中； stdin 可以从任何其他源提供（例如管道、<<<字符串重定向）。

解释：

# 将标准输入（在 Perl 中）分配给变量以使用它进行匹配
# 并检测输入何时为空（即 EOF）
while ($stdin = ) {
    # 将 $stdin 中的匹配分配给数组 @matches
    #/(技术)[^0-9]*([0-9x][0-9x.]*)/g;
    # (tech): 匹配术语“tech”，() 将其放入@matches
    # [^0-9]*: 匹配任何要丢弃的非数字，因为 .* 太贪婪了
    # 并匹配到最后一个数字
    # ([0-9x][0-9x.]*)：匹配模式0、0.x、0.9.x等，()将其放入@matches中
    @matches = $stdin =~ /(tech)[^0-9]*([0-9x][0-9x.]*)/g;
    # 如果@matches不为空则显示上面的数组
    如果@matches则打印“@matches\n”
}

将其应用于文件时，输出为：

科技1.2
技术1
科技0.1
技术10.1.3
科技7.5
科技8.0
技术1.3.x
技术5.x
技术2.0.4x

如何使用 sed、awk 或 grep 从每行提取多个模式

答案1

假设

`grep`

清楚的`grep`

简单`grep`有协助

`pcregrep`

`sed`

`awk`

答案2

答案3

相关内容

答案1

假设

grep

清楚的grep

简单grep有协助

pcregrep

sed

awk

答案2

答案3

相关内容

`grep`

清楚的`grep`

简单`grep`有协助

`pcregrep`

`sed`

`awk`