我有两个文件,其中文件 A 包含所有数据,而另一个文件 B 只包含 ID,我想要的是将文件 B 与文件 A 进行比较并检索该 id 中存在的数据。我正在使用 Suse Linux。
文件A
C 02020 Two-component system [PATH:aap02020]
D NT05HA_1798 sensor protein CpxA
D NT05HA_1797 CpxR K07662 cpxR
C 02030 *Bacterial chemotaxis* [PATH:aap02030]
D NT05HA_0919 maltose-binding periplasmic protein
D NT05HA_0918 maltose-binding periplasmic protein
C 03070 *Bacterial secretion system* [PATH:aap03070]
D NT05HA_1309 protein-export membrane protein SecD
D NT05HA_1310 protein-export membrane protein SecF
D NT05HA_1819 preprotein translocase subunit SecE
D NT05HA_1287 protein-export membrane protein
C 02060 Phosphotransferase system (PTS) [PATH:aap02060]
D NT05HA_0618 phosphoenolpyruvate-protein
D NT05HA_0617 phosphocarrier protein HPr
D NT05HA_0619 pts system
文件B
Bacterial chemotaxis
Bacterial secretion system
期望的输出:
C 02030 *Bacterial chemotaxis* [PATH:aap02030]
D NT05HA_0919 maltose-binding periplasmic protein
D NT05HA_0918 maltose-binding periplasmic protein
C 03070 *Bacterial secretion system* [PATH:aap03070]
D NT05HA_1309 protein-export membrane protein SecD
D NT05HA_1310 protein-export membrane protein SecF
D NT05HA_1819 preprotein translocase subunit SecE
D NT05HA_1287 protein-export membrane protein
答案1
你可以使用awk
:
awk 'NR==FNR{ # On the first file,
a[$0]; # store the content in the array a
next
}
{ # On the second file,
for(i in a) # for all element in the array a,
if(index($0,i)) { # check if there is match in the current record
print "C" $0 # in that case print it with the record separator
next
}
}' fileB RS='\nC' fileA
C 02030 *Bacterial chemotaxis* [PATH:aap02030]
D NT05HA_0919 maltose-binding periplasmic protein
D NT05HA_0918 maltose-binding periplasmic protein
C 03070 *Bacterial secretion system* [PATH:aap03070]
D NT05HA_1309 protein-export membrane protein SecD
D NT05HA_1310 protein-export membrane protein SecF
D NT05HA_1819 preprotein translocase subunit SecE
D NT05HA_1287 protein-export membrane protein
答案2
C <word>
如果您想完全匹配 和之间的部分[PATH:...]
(并假设*
样本中的部分只是为了强调而不是实际数据的一部分),您可以这样做:
awk '
!start {all_strings[$0]; next}
/^C/ {
key = $0
# strip the leading C <word>:
sub(/^C[[:blank:]]+[^[:blank:]]+[[:blank:]]*/, "", key)
# strip the trailing [...]:
sub(/[[:blank:]]*\[[^]]*][[:blank:]]*$/, "", key)
selected = key in all_strings
}
selected' fileB start=1 fileA
除了增加可靠性之外(例如,Bacterial secretion
仅匹配一条Bacterial secretion
记录,而不匹配Bacterial secretion system
),它也非常高效,因为文件仅读取一次,并且匹配仅是一次哈希表查找,而不是许多子字符串搜索或正则表达式匹配。
答案3
我确信我会因为使用循环而被撞倒,但仍然......这是一种方法。
#!/bin/bash
while read -r line; do
sed -n "/$line/,/^C/p" fileA | sed '$d'
done < fileB
例子:
./bacteria.sh
C 02030 *Bacterial chemotaxis* [PATH:aap02030]
D NT05HA_0919 maltose-binding periplasmic protein
D NT05HA_0918 maltose-binding periplasmic protein
C 03070 *Bacterial secretion system* [PATH:aap03070]
D NT05HA_1309 protein-export membrane protein SecD
D NT05HA_1310 protein-export membrane protein SecF
D NT05HA_1819 preprotein translocase subunit SecE
D NT05HA_1287 protein-export membrane protein
您的示例文件在哪里fileA
?fileB
正则表达式细分:
sed -n "/$line/,/^C/p" fileA | sed '$d'
打印$line
以字母 开头的行和下一行之间的行C
,但排除 ( sed '$d'
) 最后一行,因为它仅用作“停止标记”。
sed --version
sed (GNU sed) 4.2.2
bash --version
GNU bash, version 4.2.46(1)-release (x86_64-redhat-linux-gnu)
答案4
中的数据fileA
被分成以C
新行开始的记录。每条记录都分为以D
新行开头的 inte 字段。
我们需要读取行fileB
并使用它们来查询中每条记录的第一个字段fileA
:
while read -r query; do
awk -vq="$query" 'BEGIN { RS="^C|\nC"; FS=OFS="\nD" } $1 ~ q {print "C" $0}' fileA
done <fileB
我将记录分隔符 ( RS
) 设置为匹配C
行开头的任一位置或者在换行符之后,否则我们可能无法正确匹配第一个记录中的任何内容。我使用awk
变量q
来保存从文件中读取的值,并将每个记录的第一个字段与该值相匹配。
结果:
C 02030 *Bacterial chemotaxis* [PATH:aap02030]
D NT05HA_0919 maltose-binding periplasmic protein
D NT05HA_0918 maltose-binding periplasmic protein
C 03070 *Bacterial secretion system* [PATH:aap03070]
D NT05HA_1309 protein-export membrane protein SecD
D NT05HA_1310 protein-export membrane protein SecF
D NT05HA_1819 preprotein translocase subunit SecE
D NT05HA_1287 protein-export membrane protein