AWK：获取满足条件的随机文件行？

Question 1

您可以执行此操作，awk但随机选择行将很复杂，并且需要编写大量代码。我会使用awk获取符合您的条件的行，然后使用标准工具shuf来选择随机选择：

$ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2
g    1    8
a    1    5

如果运行几次，您会看到随机选择的行：

$ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--";  done
g    1    8
e    6    14
--
g    1    8
e    6    14
--
b    4    12
g    1    8
--
b    4    12
e    6    14
--
e    6    14
b    4    12
--

该shuf工具是 GNU coreutils 的一部分，因此它应该默认安装在大多数 Linux 系统上，并且可以轻松地供大多数 *nix 使用。

Answer

您可以执行此操作，awk但随机选择行将很复杂，并且需要编写大量代码。我会使用awk获取符合您的条件的行，然后使用标准工具shuf来选择随机选择：

$ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2
g    1    8
a    1    5

如果运行几次，您会看到随机选择的行：

$ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--";  done
g    1    8
e    6    14
--
g    1    8
e    6    14
--
b    4    12
g    1    8
--
b    4    12
e    6    14
--
e    6    14
b    4    12
--

该shuf工具是 GNU coreutils 的一部分，因此它应该默认安装在大多数 Linux 系统上，并且可以轻松地供大多数 *nix 使用。

Question 2

如果您想要一个仅迭代列表一次的纯 awk 答案：

awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt

存储在文件中以便于阅读：

BEGIN { srand() }
$3 - $2 > 3 &&
$3 - $2 < 10 &&
rand() < count / ++n {
    if (n <= count) {
        s[n] = $0 
    } else { 
        s[1+int(rand()*count)] = $0 
    } 
} 
END { 
    for (i in s) print s[i] 
}

该算法略有不同Knuth 算法 R;我很确定这种变化不会改变分布，但我不是统计学家，所以我不能保证这一点。

为那些不太熟悉 awk 的人评论：

# Before the first line is read...
BEGIN { 
    # ...seed the random number generator.
    srand() 
}

# For each line:
# if the difference between the second and third columns is between 3 and 10 (exclusive)...
$3 - $2 > 3 &&
$3 - $2 < 10 &&
# ... with a probability of (total rows to select) / (total matching rows so far) ...
rand() < count / ++n {
    # ... If we haven't reached the number of rows we need, just add it to our list
    if (n <= count) {
        s[n] = $0 
    } else {
        # otherwise, replace a random entry in our list with the current line.
        s[1+int(rand()*count)] = $0 
    } 
} 

# After all lines have been processed...
END { 
    # Print all lines in our list.
    for (i in s) print s[i] 
}

Answer

如果您想要一个仅迭代列表一次的纯 awk 答案：

awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt

存储在文件中以便于阅读：

BEGIN { srand() }
$3 - $2 > 3 &&
$3 - $2 < 10 &&
rand() < count / ++n {
    if (n <= count) {
        s[n] = $0 
    } else { 
        s[1+int(rand()*count)] = $0 
    } 
} 
END { 
    for (i in s) print s[i] 
}

该算法略有不同Knuth 算法 R;我很确定这种变化不会改变分布，但我不是统计学家，所以我不能保证这一点。

为那些不太熟悉 awk 的人评论：

# Before the first line is read...
BEGIN { 
    # ...seed the random number generator.
    srand() 
}

# For each line:
# if the difference between the second and third columns is between 3 and 10 (exclusive)...
$3 - $2 > 3 &&
$3 - $2 < 10 &&
# ... with a probability of (total rows to select) / (total matching rows so far) ...
rand() < count / ++n {
    # ... If we haven't reached the number of rows we need, just add it to our list
    if (n <= count) {
        s[n] = $0 
    } else {
        # otherwise, replace a random entry in our list with the current line.
        s[1+int(rand()*count)] = $0 
    } 
} 

# After all lines have been processed...
END { 
    # Print all lines in our list.
    for (i in s) print s[i] 
}

Question 3

这是在 GNU awk 中执行此操作的一种方法（支持自定义排序例程）：

#!/usr/bin/gawk -f

function mycmp(ia, va, ib, vb) {
  return rand() < 0.5 ? 0 : 1;
}

BEGIN {
  srand();
}

$3 - $2 > 3 && $3 - $2 < 10 {
  a[NR]=$0;
} 

END {
  asort(a, b, "mycmp");
  for (i = 1; i < 3; i++) print b[i];
}

使用给定数据进行测试：

$ for i in {1..6}; do printf 'Try %d:\n' $i; ../randsel.awk file; sleep 2; done
Try 1:
g    1    8
e    6    14
Try 2:
a    1    5
b    4    12
Try 3:
b    4    12
a    1    5
Try 4:
e    6    14
a    1    5
Try 5:
b    4    12
a    1    5
Try 6:
e    6    14
b    4    12

Answer

这是在 GNU awk 中执行此操作的一种方法（支持自定义排序例程）：

#!/usr/bin/gawk -f

function mycmp(ia, va, ib, vb) {
  return rand() < 0.5 ? 0 : 1;
}

BEGIN {
  srand();
}

$3 - $2 > 3 && $3 - $2 < 10 {
  a[NR]=$0;
} 

END {
  asort(a, b, "mycmp");
  for (i = 1; i < 3; i++) print b[i];
}

使用给定数据进行测试：

$ for i in {1..6}; do printf 'Try %d:\n' $i; ../randsel.awk file; sleep 2; done
Try 1:
g    1    8
e    6    14
Try 2:
a    1    5
b    4    12
Try 3:
b    4    12
a    1    5
Try 4:
e    6    14
a    1    5
Try 5:
b    4    12
a    1    5
Try 6:
e    6    14
b    4    12

Question 4

发布perl解决方案，因为我看不出有任何理由必须将其包含在内awk（除了OP的愿望之外）：

#!/usr/bin/perl

use strict;
use warnings;
my $N = 2;
my $k;
my @r;

while(<>) {
    my @line = split(/\s+/);
    if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {
        if(++$k <= $N) {
            push @r, $_;
        } elsif(rand(1) <= ($N/$k)) {
            $r[rand(@r)] = $_;
        }
    }
}

print @r;

这是一个经典的例子水库取样。该算法复制自这里并由我修改以满足OP的具体愿望。

保存在文件中时，reservoir.pl您可以使用./reservoir.pl file1 file2 file3或运行它cat file1 file2 file3 | ./reservoir.pl。

Answer

发布perl解决方案，因为我看不出有任何理由必须将其包含在内awk（除了OP的愿望之外）：

#!/usr/bin/perl

use strict;
use warnings;
my $N = 2;
my $k;
my @r;

while(<>) {
    my @line = split(/\s+/);
    if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {
        if(++$k <= $N) {
            push @r, $_;
        } elsif(rand(1) <= ($N/$k)) {
            $r[rand(@r)] = $_;
        }
    }
}

print @r;

这是一个经典的例子水库取样。该算法复制自这里并由我修改以满足OP的具体愿望。

保存在文件中时，reservoir.pl您可以使用./reservoir.pl file1 file2 file3或运行它cat file1 file2 file3 | ./reservoir.pl。

AWK：获取满足条件的随机文件行？

答案1

答案2

答案3

答案4

相关内容