我有一个如下所示的文件:
Chr Start End Ref Alt Func.refGene Gene.refGene ExonicFunc.refGene AAChange.refGene Func.knownGene Gene.knownGene
1 53387379 53387379 G C UTR5 ECHDC2 NA NA UTR5 ECHDC2(FFF)
1 53387380 53387380 G C UTR5 C2(hhh) NA NA UTR5 C2(FFF)
1 1647814 1647814 T C exonic CDK11A,CDK11B synonymous SNV NA exonic CDK11A,CDK11B
1 1647814 1647814 T C exonic CDK11A23,CDK11B23 synonymous SNV NA exonic CDK11A23,CDK11B23
1 1670958 1670958 C G exonic SLC35E2A synonymous SNV NA exonic SLC35E2
1 1684347 1684347 - CCT exonic NADK nonframeshift insertion NA exonic NADK
1 7069620 7069620 T C intronic PTPN6(ggg),IL3 NA NA intronic PTPN6(ggg),IL3
我想输出包含基因“C2”、“CDK11A”和“IL3”的所有行。显然,我有一个更大的文件和更长的基因集,但这只是为了方便起见的一个小例子。
我一直在使用以下脚本:
tail -n+1 Book3.txt | awk -F'\t' 'BEGIN{OFS=FS}{if(NR==1 || $7=="C2" || $7~/C2[(]/ || $7~/C2/ || $11=="C2" || $11~/C2[(]/ || $11~/C2/ ||
$7=="CDK11A" || $7~/CDK11A[(]/ || $7~/CDK11A/ || $11=="CDK11A" || $11~/CDK11A[(]/ || $11~/CDK11A/ ||
$7=="IL3" || $7~/IL3[(]/ || $7~/IL3/ || $11=="IL3" || $11~/IL3[(]/ || $11~/IL3/) {print($0)}}' > Book3.genes.txt
该脚本输出不必要的行,如下所示:
Chr Start End Ref Alt Func.refGene Gene.refGene ExonicFunc.refGene AAChange.refGene Func.knownGene Gene.knownGene
1 53387379 53387379 G C UTR5 ECHDC2 NA NA UTR5 ECHDC2(FFF)
1 53387380 53387380 G C UTR5 C2(hhh) NA NA UTR5 C2(FFF)
1 1647814 1647814 T C exonic CDK11A,CDK11B synonymous SNV NA exonic CDK11A,CDK11B
1 1647814 1647814 T C exonic CDK11A23,CDK11B23 synonymous SNV NA exonic CDK11A23,CDK11B23
1 7069620 7069620 T C intronic PTPN6(ggg),IL3 NA NA intronic PTPN6(ggg),IL3
我不需要第 2 行和第 5 行。如何修改脚本以在输出中仅包含给定的基因列表?
答案1
将您想要匹配的基因放入一个文件中,每行一个。然后它只是一个 grep 调用:
grep -Fwf genes.txt Book3.txt
要保留标题:
{ head -n1 Book3.txt; grep -Fwf genes.txt Book3.txt; }
grep 选项:
-F
“固定字符串”——禁用正则表达式,仅查找子字符串匹配-w
“单词匹配”——仅查找整个单词的匹配-f file
-- 指定包含模式的文件(每行一个)
使用您的样本数据
$ cat genes.txt
C2
CDK11A
IL3
$ { head -n1 Book3.txt; grep -Fwf genes.txt Book3.txt; }
Chr Start End Ref Alt Func.refGene Gene.refGene ExonicFunc.refGene AAChange.refGene Func.knownGene Gene.knownGene
1 53387380 53387380 G C UTR5 C2(hhh) NA NA UTR5 C2(FFF)
1 1647814 1647814 T C exonic CDK11A,CDK11B synonymous SNV NA exonic CDK11A,CDK11B
1 7069620 7069620 T C intronic PTPN6(ggg),IL3 NA NA intronic PTPN6(ggg),IL3