在 txt 文件中的子组中查找字符串

Question 1

可以很容易地用 Python 完成：

$ cat input.txt | ./find_strings.py PF13304.1 PF13401.1                                                                  
AAA_21               PF13304.1  x_00004
AAA_22               PF13401.1  x_00004
AAA_21               PF13304.1  x_00005
AAA_22               PF13401.1  x_00005
AAA_21               PF13304.1  x_00006
AAA_22               PF13401.1  x_00007

内容find_strings.py：

#!/usr/bin/env python
import sys
strings=sys.argv[1:]
for line in sys.stdin:
    for string in strings:
         if string in line:
             print line.strip()

这种方式的意思是，我们将输入文件的内容重定向到脚本的标准输入流，逐行读取流，然后针对每一行，我们在命令行上提供的参数列表中进行查找。相当简单的方法

Answer

可以很容易地用 Python 完成：

$ cat input.txt | ./find_strings.py PF13304.1 PF13401.1                                                                  
AAA_21               PF13304.1  x_00004
AAA_22               PF13401.1  x_00004
AAA_21               PF13304.1  x_00005
AAA_22               PF13401.1  x_00005
AAA_21               PF13304.1  x_00006
AAA_22               PF13401.1  x_00007

内容find_strings.py：

#!/usr/bin/env python
import sys
strings=sys.argv[1:]
for line in sys.stdin:
    for string in strings:
         if string in line:
             print line.strip()

这种方式的意思是，我们将输入文件的内容重定向到脚本的标准输入流，逐行读取流，然后针对每一行，我们在命令行上提供的参数列表中进行查找。相当简单的方法

Question 2

当然没有那么简单grep。这个程序：

扫描文本文件，累积第三个字段为相同字符串的“块”
当它找到一个块时，调用grep并收集输出
如果输出中的行数与搜索词的数量相同，则输出 grep 的输出

awk '
  function grep(block,    m, grep_out, cmd, line, i) {
    m = 0
    delete grep_out

    cmd = "grep -f " ARGV[1]    # define the grep command
    print block |& cmd          # invoke grep, and send the block of text as stdin
    close(cmd, "to")            # close greps stdin so we can start reading the output

    # read from grep until no more output
    while ((cmd |& getline line) > 0)
      grep_out[m++] = line
    close(cmd)

    # did grep find all search terms?  If yes, print the output 
    if (length(grep_out) == nterms)
      for (i=0; i<m; i++) 
        print grep_out[i]
  }

  # read the search terms file, just to count the number of lines
  NR == FNR {
    nterms++
    next
  }

  # if we detect a new block, call grep and start a new block
  section != $3 {
    if (block) grep(block)
    block = ""
    section = $3
  } 

  {block = block $0 RS}   # accumulate the lines in this block

  END {if (block) grep(block)}       # also call grep at end of file

' fileContainingStrings fileToScan

产生以下输出：

AAA_21               PF13304.1  x_00004
AAA_22               PF13401.1  x_00004
AAA_21               PF13304.1  x_00005
AAA_22               PF13401.1  x_00005

Answer

当然没有那么简单grep。这个程序：

扫描文本文件，累积第三个字段为相同字符串的“块”
当它找到一个块时，调用grep并收集输出
如果输出中的行数与搜索词的数量相同，则输出 grep 的输出

awk '
  function grep(block,    m, grep_out, cmd, line, i) {
    m = 0
    delete grep_out

    cmd = "grep -f " ARGV[1]    # define the grep command
    print block |& cmd          # invoke grep, and send the block of text as stdin
    close(cmd, "to")            # close greps stdin so we can start reading the output

    # read from grep until no more output
    while ((cmd |& getline line) > 0)
      grep_out[m++] = line
    close(cmd)

    # did grep find all search terms?  If yes, print the output 
    if (length(grep_out) == nterms)
      for (i=0; i<m; i++) 
        print grep_out[i]
  }

  # read the search terms file, just to count the number of lines
  NR == FNR {
    nterms++
    next
  }

  # if we detect a new block, call grep and start a new block
  section != $3 {
    if (block) grep(block)
    block = ""
    section = $3
  } 

  {block = block $0 RS}   # accumulate the lines in this block

  END {if (block) grep(block)}       # also call grep at end of file

' fileContainingStrings fileToScan

产生以下输出：

AAA_21               PF13304.1  x_00004
AAA_22               PF13401.1  x_00004
AAA_21               PF13304.1  x_00005
AAA_22               PF13401.1  x_00005

Question 3

因此，如果我理解正确的话，您希望找到包含您指定的所有模式的所有子组。这可以使用sort和来完成awk，例如：

# make sure subgroups are adjacent 
sort -k3,3 infile |

# add a newline between subroups, this allows the next 
# invocation of awk to read each subgroup as a record
awk 'NR > 1 && p!=$3 { printf "\n" } { p=$3 } 1' |   

# match the desired patterns and print the subgroup name
awk '/\<PF13304\.1\>/ && /\<PF13401\.1\>/ { print $3 }' RS=

输出：

x_00004
x_00005

根据以上输出，您现在可以从中提取相关的行infile，例如将以下内容添加到上述管道：

while read sgrp; do
  grep -E "\b(PF13304\.1|PF13401\.1)\b +$sgrp\$" infile
done

输出：

AAA_21               PF13304.1  x_00004
AAA_22               PF13401.1  x_00004
AAA_21               PF13304.1  x_00005
AAA_22               PF13401.1  x_00005

Answer

因此，如果我理解正确的话，您希望找到包含您指定的所有模式的所有子组。这可以使用sort和来完成awk，例如：

# make sure subgroups are adjacent 
sort -k3,3 infile |

# add a newline between subroups, this allows the next 
# invocation of awk to read each subgroup as a record
awk 'NR > 1 && p!=$3 { printf "\n" } { p=$3 } 1' |   

# match the desired patterns and print the subgroup name
awk '/\<PF13304\.1\>/ && /\<PF13401\.1\>/ { print $3 }' RS=

输出：

x_00004
x_00005

根据以上输出，您现在可以从中提取相关的行infile，例如将以下内容添加到上述管道：

while read sgrp; do
  grep -E "\b(PF13304\.1|PF13401\.1)\b +$sgrp\$" infile
done

输出：

AAA_21               PF13304.1  x_00004
AAA_22               PF13401.1  x_00004
AAA_21               PF13304.1  x_00005
AAA_22               PF13401.1  x_00005

Question 4

像这样吗？

awk '(/x_00004/ || /x_00005/) && (/PF13401.1/ || /PF13304.1/)' your_file

或者这个，原理相同，但分组更易读

awk '(/x_00004/ && (/PF13401.1/ || /PF13304.1/)) || (/x_00005/ && (/PF13401.1/ || /PF13304.1/))' your_file

例子

输入文件

cat foo

AAA_21               PF13304.1  x_00004
AAA_22               PF13401.1  x_00004
SMC_N                PF02463.14 x_00004
AAA_29               PF13555.1  x_00004
DUF258               PF03193.11 x_00005
AAA_15               PF13175.1  x_00005
AAA_21               PF13304.1  x_00005
AAA_22               PF13401.1  x_00005
SMC_N                PF02463.14 x_00005
AAA_15               PF13175.1  x_00006
AAA_21               PF13304.1  x_00006
AAA_22               PF13401.1  x_00007
SMC_N                PF02463.14 x_00007

命令

awk '(/x_00004/ || /x_00005/) && (/PF13401.1/ || /PF13304.1/)' foo

AAA_21               PF13304.1  x_00004
AAA_22               PF13401.1  x_00004
AAA_21               PF13304.1  x_00005
AAA_22               PF13401.1  x_00005

Answer