我有一个包含 5 个“块”的文件,如下所示:
AACP_AGRFC Agrobacterium fabrum A9CHM9 PDB; 2JQ4; NMR; -; A=1-83.
PDB; 4H2W_5GP.pdb; X-ray; 1.95 A; C/D=1-83.
PDB; 4H2X_G5A.pdb; X-ray; 2.15 A; C/D=1-83.
PDB; 4H2Y; X-ray; 2.10 A; C/D=1-83.
AADB1_KLEPN Klebsiella pneumoniae. P0AE05 PDB; 4WQK_GOL.pdb; X-ray; 1.48 A; A=1-177.
PDB; 4WQL_GOL.pdb; X-ray; 1.73 A; A=1-177.
PDB; 5KQJ; NMR; -; A=1-177.
AAKB2_RAT Rattus norvegicus Q9QZH4 PDB; 2LU3; NMR; -; A=67-163.
PDB; 2LU4; NMR; -; A=67-163.
PDB; 4Y0G_GOL.pdb; X-ray; 1.60 A; A/B=74-155.
PDB; 4YEE_GOL.pdb; X-ray; 2.00 A; A/B/C/D/E/F/G/H/I/J/K/L/M/N/O/P/Q/R=74-155.
AAPK2_HUMAN Homo sapiens P54646 PDB; 2H6D; X-ray; 1.85 A; A=6-279.
PDB; 2LTU; NMR; -; A=282-339.
PDB; 2YZA; X-ray; 3.02 A; A=6-279.
PDB; 3AQV_TAK.pdb; X-ray; 2.08 A; A=6-279.
PDB; 4CFE; X-ray; 3.02 A; A/C=1-552.
PDB; 4CFF; X-ray; 3.92 A; A/C=1-552.
PDB; 4ZHX_4O7_C1V_C2Z.pdb; X-ray; 2.99 A; A/C=2-552.
PDB; 5EZV_C1V_C2Z_STU.pdb; X-ray; 2.99 A; A/C=2-347, A/C=397-552.
PDB; 5ISO_992_STU.pdb; X-ray; 2.63 A; A/C=1-552.
ABC3B_HUMAN Homo sapiens Q9UH17 PDB; 2NBQ; NMR; -; A=187-382.
PDB; 5CQD_GOL.pdb; X-ray; 2.08 A; A/C=187-378.
PDB; 5CQH; X-ray; 1.73 A; A=187-378.
PDB; 5CQI; X-ray; 1.68 A; A=187-378.
PDB; 5CQK_GOL_PGE.pdb; X-ray; 1.88 A; A=187-378.
PDB; 5TD5; X-ray; 1.72 A; A=187-378.
PDB; 5TKM; X-ray; 1.90 A; A/B=1-191.
每行的大小不同,但我们只查找特定的列,我们查看的是X-ray
和 的列NMR
(它们总是在同一列),我们想检查每个“块”下是否>=5
有该列下的行X-ray
。如果是的话,我们想打印该块。如果不是的话,我们想删除整个块。所以预期结果应该是这样的:
AAPK2_HUMAN Homo sapiens P54646 PDB; 2H6D; X-ray; 1.85 A; A=6-279.
PDB; 2LTU; NMR; -; A=282-339.
PDB; 2YZA; X-ray; 3.02 A; A=6-279.
PDB; 3AQV_TAK.pdb; X-ray; 2.08 A; A=6-279.
PDB; 4CFE; X-ray; 3.02 A; A/C=1-552.
PDB; 4CFF; X-ray; 3.92 A; A/C=1-552.
PDB; 4ZHX_4O7_C1V_C2Z.pdb; X-ray; 2.99 A; A/C=2-552.
PDB; 5EZV_C1V_C2Z_STU.pdb; X-ray; 2.99 A; A/C=2-347, A/C=397-552.
PDB; 5ISO_992_STU.pdb; X-ray; 2.63 A; A/C=1-552.
ABC3B_HUMAN Homo sapiens Q9UH17 PDB; 2NBQ; NMR; -; A=187-382.
PDB; 5CQD_GOL.pdb; X-ray; 2.08 A; A/C=187-378.
PDB; 5CQH; X-ray; 1.73 A; A=187-378.
PDB; 5CQI; X-ray; 1.68 A; A=187-378.
PDB; 5CQK_GOL_PGE.pdb; X-ray; 1.88 A; A=187-378.
PDB; 5TD5; X-ray; 1.72 A; A=187-378.
PDB; 5TKM; X-ray; 1.90 A; A/B=1-191.
PS. 我们不能将;
其作为列的分隔符,但我们知道这些列X-ray
和NMR
所在的位置始终是PDB; XXXX(.pdb); X-ray or NMR
。
有人知道如何在 bash 中实现这一点吗?谢谢
答案1
假设你的标准可以表示为与正则表达式匹配的行数,/PDB; [^;]*; X-ray/
你可以这样做
awk -vRS= -F'\n' '
{c=0; for(i=1;i<=NF;i++) c += $i ~ /PDB; [^;]*; X-ray/ ? 1 : 0} c >= 5
'
或者(在我看来,稍微简洁一些)
perl -F'\n' -00ne 'print unless (grep { /PDB; [^;]*; X-ray/ } @F) < 5'
前任。
$ perl -F'\n' -00ne 'print unless (grep { /PDB; [^;]*; X-ray/ } @F) < 5' file
AAPK2_HUMAN Homo sapiens P54646 PDB; 2H6D; X-ray; 1.85 A; A=6-279.
PDB; 2LTU; NMR; -; A=282-339.
PDB; 2YZA; X-ray; 3.02 A; A=6-279.
PDB; 3AQV_TAK.pdb; X-ray; 2.08 A; A=6-279.
PDB; 4CFE; X-ray; 3.02 A; A/C=1-552.
PDB; 4CFF; X-ray; 3.92 A; A/C=1-552.
PDB; 4ZHX_4O7_C1V_C2Z.pdb; X-ray; 2.99 A; A/C=2-552.
PDB; 5EZV_C1V_C2Z_STU.pdb; X-ray; 2.99 A; A/C=2-347, A/C=397-552.
PDB; 5ISO_992_STU.pdb; X-ray; 2.63 A; A/C=1-552.
ABC3B_HUMAN Homo sapiens Q9UH17 PDB; 2NBQ; NMR; -; A=187-382.
PDB; 5CQD_GOL.pdb; X-ray; 2.08 A; A/C=187-378.
PDB; 5CQH; X-ray; 1.73 A; A=187-378.
PDB; 5CQI; X-ray; 1.68 A; A=187-378.
PDB; 5CQK_GOL_PGE.pdb; X-ray; 1.88 A; A=187-378.
PDB; 5TD5; X-ray; 1.72 A; A=187-378.
PDB; 5TKM; X-ray; 1.90 A; A/B=1-191.