我有一个如下所示的文件:
Text1 somethingAAxxxxxxxsomething,elseAAxxxxxxxfoo text1
Text2 somethingAAxxxxxxxsomething,elseAAxxxxxxxfoo text2
Text3 somethingAAxxxxxxxsomething,elseAAxxxxxxxfoo text3
“something”、something、else 和 foo 是随机字母/空格/逗号,这AAxxxxxxx
就是我想要匹配的内容。 X 是数字,它始终是 7 个数字 0-9 的数组,例如AA0000001
或AA9999999
。我想只提取AAxxxxxxx
第 2 列中的部分,因此我的输出如下:
Text1 AAxxxxxxx,AAxxxxxxx text1
Text2 AAxxxxxxx,AAxxxxxxx text2
Text3 AAxxxxxxx,AAxxxxxxx text3
输入示例
Text1 somethingAA0123456something,elseAA6543210foo text1
Text2 somethingAA1234567something,elseAA7654321foo text2
Text3 somethingAA2345678something,elseAA8765432foo text3
所需输出
Text1 AA0123456,AA6543210 text1
Text2 AA1234567,AA7654321 text2
Text3 AA2345678,AA8765432 text3
编辑:例如,有几行包含超过 2 个“AAxxxxxxx”段
输入
Text1 somethingAAxxxxxxxsomething,elseAAxxxxxxxfooblahAAxxxxxxx^blahblahAAxxxxxxx text1
Text2 somethingAAxxxxxxxsomething,elseAAxxxxxxxfooblahAAxxxxxxx^blah text2
Text3 somethingAAxxxxxxxsomething,elseAAxxxxxxxfoo text3
所需输出
Text1 AA0123456,AA6543210,AA1231252,AA1256712 text1
Text2 AA1234567,AA7654321,AA1926572 text2
Text3 AA2345678,AA8765432 text3
答案1
sed
可以做到这一点。我们可以使用4个匹配组来查找前缀、后缀和中间的两个ID。
代码:
sed -rn 's/([^ ]+) .*(AA[0-9]{7}).*(AA[0-9]{7}).* ([^ ]+)/\1 \2,\3 \4/p' file1
测试数据:
Text1 somethingAA0123456something,elseAA6543210foo text1
Text2 somethingAA1234567something,elseAA7654321foo text2
Text3 somethingAA2345678something,elseAA8765432foo text3
结果:
Text1 AA0123456,AA6543210 text1
Text2 AA1234567,AA7654321 text2
Text3 AA2345678,AA8765432 text3
答案2
perl
方法。方法类似sed
。针对冗长的正则表达式测试该行,将该行的相关所需部分捕获到 $1、$2、$3、$4 中。在 $_ 中构造一个答案,然后通过“-p”标志将其打印出来。
$ perl -pe 'if(/^(Text\d+) .*(AA\d{7}).*(AA\d{7}).* (.*)/){$_="$1 $2,$3 $4$/"}' file1
Text1 AA0123456,AA6543210 text1
Text2 AA1234567,AA7654321 text2
Text3 AA2345678,AA8765432 text3
$
答案3
awk解决方案:
$ a="Text1 somethingAA0123456something,elseAA9876543foo text1"
$ awk -F"[ ,]" '{match($2,/(AA[0-9]{7})/,a);match($3,/(AA[0-9]{7})/,b);print $1,a[1],",",b[1],$NF}' <<<"$a"
Text1 AA0123456,AA9876543 text1
这也有效:
$ awk '{match($0,/(\w+\s)(\w+)(\w\w[0-9]{7})(\w+,\w+)(\w\w[0-9]{7})(\w+\s)(\w+)/,a);print a[1],a[3],",",a[5],a[7]}' <<<"$a"
更新
对于您的新需求和 GNU awk,您可以使用如下内容:
$ echo "$b"
Text1 somethingAA1111111something,elseAA2222222fooblahAA3333333^blahblahAA4444444 text1
Text2 somethingAA1111111something,elseAA7777777fooblahAA5454545^blah text2
Text3 somethingAA1111111something,elseAA2222222foo text3
$ awk '{gsub(/(AA[0-9]{7})/," & ",$2)}1' <<<"$b" |awk '{printf("%s ",$1);for (i=2;i<NF;i++) {if($i ~ /AA[0-9]+/) printf("%s%s",$i,(i==NF-1)?" ":",")}}{printf(" %s\n",$NF)}'
Text1 AA1111111,AA2222222,AA3333333,AA4444444 text1
Text2 AA1111111,AA7777777,AA5454545, text2
Text3 AA1111111,AA2222222, text3
唯一的缺陷是在某些记录中最后一个 AAXXXXXXX 之后有一个额外的逗号。希望这不是一个大问题。
该解决方案结合了两个 awk。首先 awk 通过在每个AAXXXXXXX
发现之前和之后注入一个空格来转换每一行:
$ echo "$a"
Text2 somethingAA1234567something,elseAA0987654fooblahAA3333333^blah text2
$ awk '{gsub(/(AA[0-9]{7})/," & ",$2)}1' <<<"$a"
Text2 something AA1234567 something,else AA0987654 fooblah AA3333333 ^blah text2
然后,将转换后的记录输入到第二个 awk,该 awk 打印与模式匹配的第一个字段、最后一个字段和中间字段AAXXXXXXX
答案4
珀尔
perl -pale '$_ = join $", $F[0], join(",", $F[1] =~ /AA\d{7}/g), @F[2..$#F]' yourfile
重击
这里使用cat
是有意的,因为我们不想破坏位置参数并因此在子 shell 中($1, $2, ..., $#)
运行。while-loop
cat yourfile |
while read -r f1 f2 rem; do
set -- "$f1" "$(printf '%s\n' "$f2" | grep -oP 'AA\d{7}' | paste -sd,)" "$rem"
printf '%s\n' "$*"
done
塞德
sed -e '
s/[^ ]*[ ]*/&\
\
/
s/AA[0-9]\{7\}/\
&\
/g
:loop
s/\nAA[0-9]\{7\}\(\n\)/\1&/
s/\n\n.*\(\n\n\)/\1/
s/\(\n\n\)\(AA[0-9]\{7\}\)\n/\2,\1/
/\nAA[0-9]\{7\}\n/bloop
s/,\n\n[^ ]*//
' yourfile