解释

解释

也许我的问题应该以两种方式回答,但我希望可以用一个“sed”来完成:

我有以下几行,具有不同的 ID:

ID1_TRINITY_DN120587_c0_g1::TRINITY_DN120587_c0_g1_i1::g.8298::m.8298

我想得到:

TRINITY_DN120587_c0_g1_i1[ID1]

答案1

sed -e '
   s/::/\n/;s//\n/
   s/^\([^_]*\)_.*\n\(.*\)\n.*/\2[\1]/
   ;#  |--1---|      |-2-|
' ID.data

将标记放置在 ID 字符串周围,并抓住第一个 _ 之前的部分,并将整行替换为这些值。输出:

TRINITY_DN120587_c0_g1_i1[ID1]

解释

              ID1_TRINITY_DN120587_c0_g1::TRINITY_DN120587_c0_g1_i1::g.8298::m.8298
              |-|                         |-----------------------|

您说您想要提取位于第一次和第二次出现 :: 之间的 ID

步骤 1:在感兴趣的区域周围放置一个标记(通常是 \n):

       s/::/\n/;s//\n/

   This is how the pattern space looks after the above tranformation

              ID1_TRINITY_DN120587_c0_g1\nTRINITY_DN120587_c0_g1_i1\ng.8298::m.8298

步骤 2:提取两个 \ns 之间的 ID 以及第一次出现 _ 左侧的字符串

                    s/^\([^_]*\)_.*\n\(.*\)\n.*/\2[\1]/
                    ;#  |------|      |---|
                    ;#     \1           \2

   [^_]       => matches any char but an underscore

   [^_]*      => matches 0 or more non underscore char(s)

   \([^_]*\)  => store what was matched into a memory, recallable as \1

   ^\([^_]*\) => anchor your matching from the start of the string

   .*\n       => go upto to the rightmost \n you can see in the string

   \n\(.*\)\n => Ooops!! we see another \n, hence we need to backtrack to
                 the previous \n position and from there start moving right again
                 and stop at the rightmost \n. Whatever is between these positions
                 is the string ID and is recallable as \2. Since the \ns fall outside
                 the \(...\), hence they wouldn't be stored in \2.

   .*         => This is a catchall that we stroll to the end of the string after
                 starting from the rightmost \n position and do nothing with it.

 So our regex engine has matched against the input string it was given in
 the pattern space and was able to store in two memory locations the data
 it was able to gather, viz.: \1 => stores the string portion which is in
 between the beginning of the pattern space and the 1st occurrence of the
 underscore.

 \2 => store the string portion which is in between the 1st and 2nd
       occurrences of :: in the pattern space.

                      \1 = ID1
                      \2 = TRINITY_DN120587_c0_g1_i1

 Now comes the replacement part. Remember that the regex engine was able to scan
 the whole of pattern space from beginning till end, hence the replacement
 will effect the whole of the pattern space.

 \2[\1] => We replace the matched portion of the pattern space (in our case it
           happens to be the entire string) with what has been stored in
           the memory \2 literal [ memory \1 literal ]
           leading to what we see below:

                  TRINITY_DN120587_c0_g1_i1[ID1]

In other words, you have just managed to turn the pattern space from:

              ID1_TRINITY_DN120587_c0_g1::TRINITY_DN120587_c0_g1_i1::g.8298::m.8298

into the following:

                  TRINITY_DN120587_c0_g1_i1[ID1]

答案2

awk解决方案:

awk -F'::' '{ print $2"[" substr($1,1,index($1,"_")-1) "]"}' file

输出:

TRINITY_DN120587_c0_g1_i1[ID1]

  • -F'::'- 字段分隔符

  • substr($1,1,index($1,"_")-1)- 从第一个字段中提取子字符串,从第一个位置开始直到第一次出现_(ie ID1)

答案3

我在这里假设您的模式将保持不变,这个单一的sed解决方案应该有效。

sed -n "s/^\([^_]*\)_[^:]*::\([^:]*\)::.*/\2\[\1\]/p" filename

输出例如输入:

TRINITY_DN120587_c0_g1_i1[ID1]

说明:从行首开始,匹配到第一个下划线的内容[^_]*并将其存储在第一组中,然后在第一个和第二个双冒号之间匹配第二组[^:]*。替换该行并与所需的输出格式匹配,p 打印修改后的行。

相关内容