将文件 B 与 A 进行比较，并使用 awk、sed 或 grep 从 A 中提取数据

Question 1

你可以使用awk：

awk 'NR==FNR{         # On the first file,
       a[$0];         # store the content in the array a
       next
     } 
     {                        # On the second file, 
         for(i in a)          # for all element in the array a,
            if(index($0,i)) { # check if there is match in the current record
               print "C" $0   # in that case print it with the record separator
               next
            }
     }' fileB RS='\nC' fileA
C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein 
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD 
D      NT05HA_1310 protein-export membrane protein SecF 
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

Answer

你可以使用awk：

awk 'NR==FNR{         # On the first file,
       a[$0];         # store the content in the array a
       next
     } 
     {                        # On the second file, 
         for(i in a)          # for all element in the array a,
            if(index($0,i)) { # check if there is match in the current record
               print "C" $0   # in that case print it with the record separator
               next
            }
     }' fileB RS='\nC' fileA
C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein 
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD 
D      NT05HA_1310 protein-export membrane protein SecF 
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

Question 2

C <word>如果您想完全匹配和之间的部分[PATH:...]（并假设*样本中的部分只是为了强调而不是实际数据的一部分），您可以这样做：

awk '
  !start {all_strings[$0]; next}
  /^C/ {
    key = $0

    # strip the leading C <word>:
    sub(/^C[[:blank:]]+[^[:blank:]]+[[:blank:]]*/, "", key)

    # strip the trailing [...]:
    sub(/[[:blank:]]*\[[^]]*][[:blank:]]*$/, "", key)
    selected = key in all_strings
  }
  selected' fileB start=1 fileA

除了增加可靠性之外（例如，Bacterial secretion仅匹配一条Bacterial secretion记录，而不匹配Bacterial secretion system），它也非常高效，因为文件仅读取一次，并且匹配仅是一次哈希表查找，而不是许多子字符串搜索或正则表达式匹配。

Answer

C <word>如果您想完全匹配和之间的部分[PATH:...]（并假设*样本中的部分只是为了强调而不是实际数据的一部分），您可以这样做：

awk '
  !start {all_strings[$0]; next}
  /^C/ {
    key = $0

    # strip the leading C <word>:
    sub(/^C[[:blank:]]+[^[:blank:]]+[[:blank:]]*/, "", key)

    # strip the trailing [...]:
    sub(/[[:blank:]]*\[[^]]*][[:blank:]]*$/, "", key)
    selected = key in all_strings
  }
  selected' fileB start=1 fileA

除了增加可靠性之外（例如，Bacterial secretion仅匹配一条Bacterial secretion记录，而不匹配Bacterial secretion system），它也非常高效，因为文件仅读取一次，并且匹配仅是一次哈希表查找，而不是许多子字符串搜索或正则表达式匹配。

Question 3

我确信我会因为使用循环而被撞倒，但仍然......这是一种方法。

#!/bin/bash

while read -r line; do
        sed -n "/$line/,/^C/p" fileA | sed '$d'
        done < fileB

例子：

./bacteria.sh 
C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein 
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD 
D      NT05HA_1310 protein-export membrane protein SecF 
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

您的示例文件在哪里fileA？fileB

正则表达式细分：

sed -n "/$line/,/^C/p" fileA | sed '$d'

打印$line以字母开头的行和下一行之间的行C，但排除 ( sed '$d') 最后一行，因为它仅用作“停止标记”。

sed --version
sed (GNU sed) 4.2.2

bash --version
GNU bash, version 4.2.46(1)-release (x86_64-redhat-linux-gnu)

Answer

我确信我会因为使用循环而被撞倒，但仍然......这是一种方法。

#!/bin/bash

while read -r line; do
        sed -n "/$line/,/^C/p" fileA | sed '$d'
        done < fileB

例子：

./bacteria.sh 
C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein 
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD 
D      NT05HA_1310 protein-export membrane protein SecF 
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

您的示例文件在哪里fileA？fileB

正则表达式细分：

sed -n "/$line/,/^C/p" fileA | sed '$d'

打印$line以字母开头的行和下一行之间的行C，但排除 ( sed '$d') 最后一行，因为它仅用作“停止标记”。

sed --version
sed (GNU sed) 4.2.2

bash --version
GNU bash, version 4.2.46(1)-release (x86_64-redhat-linux-gnu)

Question 4

中的数据fileA被分成以C新行开始的记录。每条记录都分为以D新行开头的 inte 字段。

我们需要读取行fileB并使用它们来查询中每条记录的第一个字段fileA：

while read -r query; do
    awk -vq="$query" 'BEGIN { RS="^C|\nC"; FS=OFS="\nD" } $1 ~ q {print "C" $0}' fileA
done <fileB

我将记录分隔符 ( RS) 设置为匹配C行开头的任一位置或者在换行符之后，否则我们可能无法正确匹配第一个记录中的任何内容。我使用awk变量q来保存从文件中读取的值，并将每个记录的第一个字段与该值相匹配。

结果：

C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD
D      NT05HA_1310 protein-export membrane protein SecF
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

Answer

中的数据fileA被分成以C新行开始的记录。每条记录都分为以D新行开头的 inte 字段。

我们需要读取行fileB并使用它们来查询中每条记录的第一个字段fileA：

while read -r query; do
    awk -vq="$query" 'BEGIN { RS="^C|\nC"; FS=OFS="\nD" } $1 ~ q {print "C" $0}' fileA
done <fileB

我将记录分隔符 ( RS) 设置为匹配C行开头的任一位置或者在换行符之后，否则我们可能无法正确匹配第一个记录中的任何内容。我使用awk变量q来保存从文件中读取的值，并将每个记录的第一个字段与该值相匹配。

结果：

C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD
D      NT05HA_1310 protein-export membrane protein SecF
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

将文件 B 与 A 进行比较，并使用 awk、sed 或 grep 从 A 中提取数据

答案1

答案2

答案3

答案4

相关内容