在一个文件中使用 grep 单词,并使用该单词在另一个文件中进行匹配,添加匹配项后面的内容

在一个文件中使用 grep 单词,并使用该单词在另一个文件中进行匹配,添加匹配项后面的内容

我想在 file1 中 grep 几个单词,并使用每个单词在 file2 中 grep 其匹配项后面的内容。然后我想将匹配项后面的字符串添加到我使用的单词中,以便 file03 包含

word1 [the thing that was found using word1 in a grep in file2]
word2 [the thing that was found using word1 in a grep in file2]

我拥有的部分文件是:file1:

JAN1319964: PGSC|PGSC0003DMP400068385_PGSC0003DMT400096710  PGSC|PGSC0003DMP400062633_PGSC0003DMT400090958 PGSC|PGSC0003DMP400066271_PGSC0003DMT400094596 PGSC|PGSC0003DMP400064671_PGSC0003DMT400092996 PGSC|PGSC0003DMP400068967_PGSC0003DMT400097292
JAN1327159: PGSC|PGSC0003DMP400016823_PGSC0003DMT400024599 PGSC|PGSC0003DMP400017933_PGSC0003DMT400026257 Dul|Dul_comp58749_c0_seq2-1
JAN1330513: Des|Des_g36886.t1 PGSC|PGSC0003DMP400049952_PGSC0003DMT400073802

文件2:

>Dul|Dul_g997.t1
ESECRVQYFSDDEVSPVTEVTGRRGSICVVCRLVPKASVSESSFLK
>Dul|Dul_g998.t1
MDDKRLWEEEERRRIAVRQREERGKIYERQKALEEQEKLAAIESYQDAIRREREEEERLKEKKKKKKKTEIRDDYLDDFLPRRNDRRIPDRDRSVKRRQTFESGRHAKEHAPPTKRRRGGEVGLSNILEEIVDTLKNNVNVSYLFLKPVTRKEAPDYHKYVKRPMDLSTIKERARKLEYKNRGQFRHDVAQITINAHLYNDGRNPGIPPLADQLLEICDYLLEENESILAEAESAI
>Dul|Dul_g999.t1
MDDKRLWEEEERRRIAVRQREERGKIYERQKALEEQEKLAAIESYQDAIRREREEEERLKEKKKKKKKTEIRDDYLDDFLPRRNDRRIPDRDRSVKRRQTFESGRHAKEHAPPTKRRRGGEVGLSNILEEIVDTLKNNVNVSYLFLKPVTRKEAPDYHKYVKRPMDLSTIKERARKLEYKNRGQFRHDVAQITINAHLYNDGRNPGIPPLADQLLEICDYLLEENESILAEAESGIEQ
>Des|Des_g1.t1
FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK

我想要的输出针对的是这个例子:

JAN1319964: PGSC|PGSC0003DMP400068385_PGSC0003DMT400096710 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400062633_PGSC0003DMT400090958 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400066271_PGSC0003DMT400094596 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400064671_PGSC0003DMT400092996 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400068967_PGSC0003DMT400097292  [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
JAN1327159: PGSC|PGSC0003DMP400016823_PGSC0003DMT400024599 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400017933_PGSC0003DMT400026257 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
Dul|Dul_comp58749_c0_seq2-1
JAN1330513: Des|Des_g36886.t1  [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400049952_PGSC0003DMT400073802 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK

如您所见,文件 1 中缺少一些信息,而这些信息包含在文件 2 中,需要将其添加到文件 1 中。如果有人知道如何做到这一点,我将不胜感激!

答案1

我不太明白你的问题,所以我会回答我思考你问的是。如果你有一个像这样的感兴趣的标识符文件(我假设第一个字段永远不是标识符,我还假设至少有一些 ID 存在于序列文件中,但你示例中的 ID 都不存在):

Jan12345: ID1 ID2 ... IDN1
Jan67899: ID11 ID12 ... IDN2

像这样的 Fasta 文件:

>ID1
ABCDEFG
>ID2
HIJKLMN
>IDN1
OPQRSTU
>ID11
WXYZABC
>ID12
DEFGHIJ
>IDN2
KLMNOPQ

你想要一个像这样的输出文件:

Jan12345 ID1 ABCDEFG ID2 HIJKLMN ... IDN OPQRSTU

你可以做这样的事情:

  1. 将此脚本另存为FastaToTbl并使其可执行(chmod 744 FastaToTbl):

    #! /bin/sh
    gawk '{
            if (substr($1,1,1)==">")
           if (NR>1)
                 printf "\n%s\t", substr($0,2,length($0)-1)
          else 
             printf "%s\t", substr($0,2,length($0)-1)
           else 
              printf "%s", $0
    }END{printf "\n"}'  "$@"
    

    这会将 FASTA 转换为,(ID<TAB>SEQUENCE)。

  2. 与此脚本结合使用FastaToTbl,从中提取 IDfile1和序列file2

    FastaToTbl file2 | 
      perl -ne 'chomp;@a=split(/\t/); $k{$a[0]}=$a[1]; ## Collect the sequences
                                                       ## $k{ID}=SEQUENCE
          END{open(A,"file1");   ## Open ID file
             while(<A>){         ## and process it line by line
               @a=split(/\s+/);  ## Gather the IDs in array @a
               print shift(@a);  ## Print the first element (Jan123:)
               print " $_ $k{$_}" for @a; ## Print each ID and its seq
               print "\n";
     }}' 
    Jan12345:ID1 ABCDEFG ID2 HIJKLMN IDN1 OPQRSTU
    Jan67899:ID11 WXYZABC ID12 DEFGHIJ IDN2 KLMNOPQ
    

相关内容