计算匹配的行数

Question 1

您可以通过以下方式完成此操作：

grep -Fowf input1 input2 | sort | uniq -c

这将按照您想要的方式反向产生输出，但如果您需要按该顺序输出，您可以执行以下操作：

grep -Fowf input1 input2 | sort | uniq -c | awk '{c = $1; $1 = ""; print $0,c}'

Answer

您可以通过以下方式完成此操作：

grep -Fowf input1 input2 | sort | uniq -c

这将按照您想要的方式反向产生输出，但如果您需要按该顺序输出，您可以执行以下操作：

grep -Fowf input1 input2 | sort | uniq -c | awk '{c = $1; $1 = ""; print $0,c}'

Question 2

您可以在 Awk 中使用数组来做到这一点：

awk '
  NR==FNR {a[$0]==1; next} 
  {for(x in a) c[x] += $0 ~ x} 
  END {for (x in a) print x, c[x]}
' input1 input2
Equus caballus 10
Saccharomyces cerevisiae S288c 1
Monodelphis domestica 5

（可以使用单个数组来完成，但恕我直言，使用单独的数组来存储索引集和计数会更干净。）

Answer

您可以在 Awk 中使用数组来做到这一点：

awk '
  NR==FNR {a[$0]==1; next} 
  {for(x in a) c[x] += $0 ~ x} 
  END {for (x in a) print x, c[x]}
' input1 input2
Equus caballus 10
Saccharomyces cerevisiae S288c 1
Monodelphis domestica 5

（可以使用单个数组来完成，但恕我直言，使用单独的数组来存储索引集和计数会更干净。）

Question 3

perl -lne '
    $h{"$_"}=$h[@h]=$_,next if @ARGV && !exists $h{$_};
    for my $h (@h) { 1+index(s/\h+/ /rg, " $h chromosome ") && $s{$h}++; }
    }{print "$_ $s{$_}" for @h;
' file1 file2

输出：

Equus caballus 10
Monodelphis domestica 5
Saccharomyces cerevisiae S288c 1

解释：

-n将逐行调用Perl读取文件，除非要求，否则不会打印。
-l将使RS = ORS = \n
涉及的数据结构：
- 哈希%h将具有从读取的基因键file1。
- 数组@h将按照读取时遇到的顺序包含基因（非重复）file1。
- 散列%s应具有具有基因的键和值作为该基因在中出现的次数file2。
在职的：
- @ARGV读取第一个参数（file1）时应有 1 个文件的内容，读取第二个参数（file2）时应为空。因此，第一行将file仅适用于并将填充 hash%h和 array @h。
- 第二行将应用于读取的行file2，并将哈希值更新%s为在给定行上找到特定基因的次数。
  - index(str, substr)如果找到，函数将返回子字符串在字符串中的位置，失败时返回 -1。
- 第 3 行将在读取 file2 后运行，并%s根据 array 设置的键的顺序打印哈希的内容@h。

Answer

perl -lne '
    $h{"$_"}=$h[@h]=$_,next if @ARGV && !exists $h{$_};
    for my $h (@h) { 1+index(s/\h+/ /rg, " $h chromosome ") && $s{$h}++; }
    }{print "$_ $s{$_}" for @h;
' file1 file2

输出：

Equus caballus 10
Monodelphis domestica 5
Saccharomyces cerevisiae S288c 1

解释：

-n将逐行调用Perl读取文件，除非要求，否则不会打印。
-l将使RS = ORS = \n
涉及的数据结构：
- 哈希%h将具有从读取的基因键file1。
- 数组@h将按照读取时遇到的顺序包含基因（非重复）file1。
- 散列%s应具有具有基因的键和值作为该基因在中出现的次数file2。
在职的：
- @ARGV读取第一个参数（file1）时应有 1 个文件的内容，读取第二个参数（file2）时应为空。因此，第一行将file仅适用于并将填充 hash%h和 array @h。
- 第二行将应用于读取的行file2，并将哈希值更新%s为在给定行上找到特定基因的次数。
  - index(str, substr)如果找到，函数将返回子字符串在字符串中的位置，失败时返回 -1。
- 第 3 行将在读取 file2 后运行，并%s根据 array 设置的键的顺序打印哈希的内容@h。

Question 4

这是一个可以执行您想要的操作的 Python 脚本：

#!/usr/bin/env python
"""Count the occurrences of the lines in file1 inside file2."""

import sys

path1 = sys.argv[1]
path2 = sys.argv[2]

counts = dict()
with open(path1, 'r') as file1:
    for line1 in file1.read().splitlines():
        counts[line1] = 0
        with open(path2, 'r') as file2:
            for line2 in file2.read().splitlines():
                if line1 in line2:
                    counts[line1] += 1

for key, value in counts.iteritems():
    print("{}: {}".format(key, value))

你可以这样运行它：

python count.py file1 file2

在给定的示例数据上，它给出以下输出：

Monodelphis domestica: 5
Saccharomyces cerevisiae S288c: 1
Equus caballus: 10

这是一个执行相同操作的 Bash 脚本：

#!/bin/bash

file1 = "$1"
file2 = "$2"

while read line; do
    count="$(grep -F "${line}" file2 | wc -l)"; echo "${line}: ${count}";
done < file1

Answer

这是一个可以执行您想要的操作的 Python 脚本：

#!/usr/bin/env python
"""Count the occurrences of the lines in file1 inside file2."""

import sys

path1 = sys.argv[1]
path2 = sys.argv[2]

counts = dict()
with open(path1, 'r') as file1:
    for line1 in file1.read().splitlines():
        counts[line1] = 0
        with open(path2, 'r') as file2:
            for line2 in file2.read().splitlines():
                if line1 in line2:
                    counts[line1] += 1

for key, value in counts.iteritems():
    print("{}: {}".format(key, value))

你可以这样运行它：

python count.py file1 file2

在给定的示例数据上，它给出以下输出：

Monodelphis domestica: 5
Saccharomyces cerevisiae S288c: 1
Equus caballus: 10

这是一个执行相同操作的 Bash 脚本：

#!/bin/bash

file1 = "$1"
file2 = "$2"

while read line; do
    count="$(grep -F "${line}" file2 | wc -l)"; echo "${line}: ${count}";
done < file1

计算匹配的行数

答案1

答案2

答案3

答案4

相关内容