我想在 file1 中 grep 几个单词,并使用每个单词在 file2 中 grep 其匹配项后面的内容。然后我想将匹配项后面的字符串添加到我使用的单词中,以便 file03 包含
word1 [the thing that was found using word1 in a grep in file2]
word2 [the thing that was found using word1 in a grep in file2]
我拥有的部分文件是:file1:
JAN1319964: PGSC|PGSC0003DMP400068385_PGSC0003DMT400096710 PGSC|PGSC0003DMP400062633_PGSC0003DMT400090958 PGSC|PGSC0003DMP400066271_PGSC0003DMT400094596 PGSC|PGSC0003DMP400064671_PGSC0003DMT400092996 PGSC|PGSC0003DMP400068967_PGSC0003DMT400097292
JAN1327159: PGSC|PGSC0003DMP400016823_PGSC0003DMT400024599 PGSC|PGSC0003DMP400017933_PGSC0003DMT400026257 Dul|Dul_comp58749_c0_seq2-1
JAN1330513: Des|Des_g36886.t1 PGSC|PGSC0003DMP400049952_PGSC0003DMT400073802
文件2:
>Dul|Dul_g997.t1
ESECRVQYFSDDEVSPVTEVTGRRGSICVVCRLVPKASVSESSFLK
>Dul|Dul_g998.t1
MDDKRLWEEEERRRIAVRQREERGKIYERQKALEEQEKLAAIESYQDAIRREREEEERLKEKKKKKKKTEIRDDYLDDFLPRRNDRRIPDRDRSVKRRQTFESGRHAKEHAPPTKRRRGGEVGLSNILEEIVDTLKNNVNVSYLFLKPVTRKEAPDYHKYVKRPMDLSTIKERARKLEYKNRGQFRHDVAQITINAHLYNDGRNPGIPPLADQLLEICDYLLEENESILAEAESAI
>Dul|Dul_g999.t1
MDDKRLWEEEERRRIAVRQREERGKIYERQKALEEQEKLAAIESYQDAIRREREEEERLKEKKKKKKKTEIRDDYLDDFLPRRNDRRIPDRDRSVKRRQTFESGRHAKEHAPPTKRRRGGEVGLSNILEEIVDTLKNNVNVSYLFLKPVTRKEAPDYHKYVKRPMDLSTIKERARKLEYKNRGQFRHDVAQITINAHLYNDGRNPGIPPLADQLLEICDYLLEENESILAEAESGIEQ
>Des|Des_g1.t1
FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
我想要的输出针对的是这个例子:
JAN1319964: PGSC|PGSC0003DMP400068385_PGSC0003DMT400096710 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400062633_PGSC0003DMT400090958 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400066271_PGSC0003DMT400094596 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400064671_PGSC0003DMT400092996 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400068967_PGSC0003DMT400097292 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
JAN1327159: PGSC|PGSC0003DMP400016823_PGSC0003DMT400024599 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400017933_PGSC0003DMT400026257 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
Dul|Dul_comp58749_c0_seq2-1
JAN1330513: Des|Des_g36886.t1 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400049952_PGSC0003DMT400073802 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
如您所见,文件 1 中缺少一些信息,而这些信息包含在文件 2 中,需要将其添加到文件 1 中。如果有人知道如何做到这一点,我将不胜感激!
答案1
我不太明白你的问题,所以我会回答我思考你问的是。如果你有一个像这样的感兴趣的标识符文件(我假设第一个字段永远不是标识符,我还假设至少有一些 ID 存在于序列文件中,但你示例中的 ID 都不存在):
Jan12345: ID1 ID2 ... IDN1
Jan67899: ID11 ID12 ... IDN2
像这样的 Fasta 文件:
>ID1
ABCDEFG
>ID2
HIJKLMN
>IDN1
OPQRSTU
>ID11
WXYZABC
>ID12
DEFGHIJ
>IDN2
KLMNOPQ
你想要一个像这样的输出文件:
Jan12345 ID1 ABCDEFG ID2 HIJKLMN ... IDN OPQRSTU
你可以做这样的事情:
将此脚本另存为
FastaToTbl
并使其可执行(chmod 744 FastaToTbl
):#! /bin/sh gawk '{ if (substr($1,1,1)==">") if (NR>1) printf "\n%s\t", substr($0,2,length($0)-1) else printf "%s\t", substr($0,2,length($0)-1) else printf "%s", $0 }END{printf "\n"}' "$@"
这会将 FASTA 转换为表,(
ID<TAB>SEQUENCE
)。与此脚本结合使用
FastaToTbl
,从中提取 IDfile1
和序列file2
:FastaToTbl file2 | perl -ne 'chomp;@a=split(/\t/); $k{$a[0]}=$a[1]; ## Collect the sequences ## $k{ID}=SEQUENCE END{open(A,"file1"); ## Open ID file while(<A>){ ## and process it line by line @a=split(/\s+/); ## Gather the IDs in array @a print shift(@a); ## Print the first element (Jan123:) print " $_ $k{$_}" for @a; ## Print each ID and its seq print "\n"; }}' Jan12345:ID1 ABCDEFG ID2 HIJKLMN IDN1 OPQRSTU Jan67899:ID11 WXYZABC ID12 DEFGHIJ IDN2 KLMNOPQ