从 Web 列表中提取 IP

Question 1

您可以使用 awk 完成整个任务（当然假设路径名）：

#!/usr/bin/awk -f

/^[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[1-9][0-9]*$/ {
        print;
        next;
}
/^[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[1-9][0-9]*[^0-9\.:].*$/ {
        sub("[^0-9.].*$","");
        print;
}

第一个模式仅匹配 IPv4（无后续文本），第二个模式允许匹配一些其他文本（并排除带冒号的行）。

顺便说一句，模式应该使用锚定"^"和"$"跳过不需要的匹配项。

这显示为一个脚本，然后可以像任何其他命令一样运行（例如带有 grep 的管道）：

./foo <foo.in

给出

129.130.100.100
1.160.118.30
91.121.120.228
62.210.111.59
52.90.253.169

我将匹配拆分为两个表达式，以简化处理 IP 地址后的杂散文本。该范围[^0-9:\.:]确保至少有一个杂散字符需要处理。

awk 程序不必是脚本，而是自由格式（并且在创建单个命令字符串时可以丢弃换行符）。然而，单行结果很难阅读。

-o与使用选项grep -E或-E选项的建议不同sed，此awk解决方案应该适用于任何 POSIX 系统。

供参考（POSIX）：

Answer

您可以使用 awk 完成整个任务（当然假设路径名）：

#!/usr/bin/awk -f

/^[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[1-9][0-9]*$/ {
        print;
        next;
}
/^[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[1-9][0-9]*[^0-9\.:].*$/ {
        sub("[^0-9.].*$","");
        print;
}

第一个模式仅匹配 IPv4（无后续文本），第二个模式允许匹配一些其他文本（并排除带冒号的行）。

顺便说一句，模式应该使用锚定"^"和"$"跳过不需要的匹配项。

这显示为一个脚本，然后可以像任何其他命令一样运行（例如带有 grep 的管道）：

./foo <foo.in

给出

129.130.100.100
1.160.118.30
91.121.120.228
62.210.111.59
52.90.253.169

我将匹配拆分为两个表达式，以简化处理 IP 地址后的杂散文本。该范围[^0-9:\.:]确保至少有一个杂散字符需要处理。

awk 程序不必是脚本，而是自由格式（并且在创建单个命令字符串时可以丢弃换行符）。然而，单行结果很难阅读。

-o与使用选项grep -E或-E选项的建议不同sed，此awk解决方案应该适用于任何 POSIX 系统。

供参考（POSIX）：

Question 2

只需指定您不允许0在正则表达式末尾添加 s 即可：

$ grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[1-9][0-9]*' file 
129.130.100.100
1.160.118.30
91.121.120.228
62.210.111.59
52.90.253.169

技巧是\.[1-9][0-9]*，这意味着匹配 a .，然后匹配一次大于 0 的任何数字（不能有以019或类似数字结尾的 IP），然后匹配 0 个或多个从 0 到 9 的数字。

我还会用它grep -E来简化语法：

grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[1-9][0-9]*' file

或者，更简单：

grep -Eo '([0-9]{1,3}\.){3}[1-9]\d*' file

并且，如果您grep支持，grep -P请进一步简化：

grep -Po '(\d{1,3}\.){3}[1-9]\d*' file

Answer

只需指定您不允许0在正则表达式末尾添加 s 即可：

$ grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[1-9][0-9]*' file 
129.130.100.100
1.160.118.30
91.121.120.228
62.210.111.59
52.90.253.169

技巧是\.[1-9][0-9]*，这意味着匹配 a .，然后匹配一次大于 0 的任何数字（不能有以019或类似数字结尾的 IP），然后匹配 0 个或多个从 0 到 9 的数字。

我还会用它grep -E来简化语法：

grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[1-9][0-9]*' file

或者，更简单：

grep -Eo '([0-9]{1,3}\.){3}[1-9]\d*' file

并且，如果您grep支持，grep -P请进一步简化：

grep -Po '(\d{1,3}\.){3}[1-9]\d*' file

Question 3

$ sed -E -e 's/[[:space:];#\/].*//;
             /\.0$+|[0-9a-f]{1,4}:|^[[:space:]]*$/d' spamhaus.txt 
129.130.100.100
1.160.118.30
91.121.120.228
62.210.111.59
52.90.253.169

（添加换行符和缩进以提高可读性）

删除行上第一个空格中的注释和所有内容（即替换为空字符串）
删除包含以下内容的行：
- .0后跟斜杠或行尾
- 1-4 个十六进制数字，后跟“:”
- 空行和仅空白行
打印其他所有内容。

相同的算法perl：

perl -lne 's/[[:space:];#\/].*//;
           next if (m/\.0$|[0-9a-f]{1,4}:|^\s*$/o);
           print'

使用从各自主机下载的完整文件连续运行每个方法 10 次的计时测试脚本的输出：

$ ./timing.sh 
input file sizes:
24K drop.txt
72K base_90days.txt
120K    sinokoreacidr.txt
216K    total

input file line count:
   793 drop.txt
  4997 base_90days.txt
  5400 sinokoreacidr.txt
 11190 total

tdickey.awk: real 0m0.367s  user 0m0.305s   sys 0m0.027s
terdon.grep: real 0m0.550s  user 0m0.514s   sys 0m0.029s
cas.sed    : real 0m0.531s  user 0m0.484s   sys 0m0.035s
cas.perl   : real 0m0.379s  user 0m0.341s   sys 0m0.036s

output line counts:
  4990 out.cas.perl
  4990 out.cas.sed
  4990 out.tdickey.awk
  4990 out.terdon.grep

output differences (if any):

（顺便说一句，timing.sh 测试脚本在我原来的 sed 脚本中发现了一个错误。某些行打印时带有尾随 /CIDR。已修复）

所有这些都产生了完全相同的输出，这很好:)

我在 AMD Phenom II 1090T 上运行了多次。sed和版本grep具有相对稳定的计时，运行之间几乎没有差异，最多一到两毫秒。

awk和版本perl在运行之间的差异稍大 - 高达 20 毫秒左右......几乎总是彼此相差几毫秒。有时perl稍快一些，通常awk稍快一些。可能是因为我的系统同时运行很多其他东西。

在此 CPU 上，考虑到运行这些版本中的任何一个的时间都很短，因此它们之间没有显着差异。在较慢的 CPU 上，差异可能更显着。我在下面包含了计时脚本，以便您可以在自己的系统上进行测试：

#!/bin/bash

export TIMEFORMAT=$'real %3lR\tuser %3lU\tsys %3lS'
files=(drop.txt base_90days.txt sinokoreacidr.txt)

function timetest() {
  # first arg is title string, remaining args are executed.

  # prime the cache
  cat "${files[@]}" > /dev/null

  title="$1" ; shift
  printf '%-11s' "$title" >&2

  # 10 runs for each
  time for i in {1..10} ; do
    "$@" "${files[@]}" > "out.$title"
  done
  # unique sort the output, but don't include sort in timings
  sort -u "out.$title" > "out.tmp" ; mv -f out.tmp "out.$title"
}

echo 'input file sizes:'
du -sch "${files[@]}"
echo
echo 'input file line count:'
wc -l "${files[@]}"
echo

rm -f out.*

timetest tdickey.awk ./tdickey.awk
timetest terdon.grep grep -h -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[1-9][0-9]*'
timetest cas.sed sed -E -e 's/[[:space:];#\/].*//; /\.0$|[0-9a-f]{1,4}:|^[[:space:]]*$/d'
timetest cas.perl perl -lne 's/[[:space:];#\/].*//; next if (m/\.0$|[0-9a-f]{1,4}:|^\s*$/o); print'

echo
echo "output line counts:"
wc -l out.* | grep -v total

# check if they all produce exactly the same output
echo
echo "output differences (if any):"
diff -u out.cas.sed out.cas.perl
diff -u out.cas.sed out.tdickey.awk
diff -u out.cas.sed out.terdon.grep

Answer

$ sed -E -e 's/[[:space:];#\/].*//;
             /\.0$+|[0-9a-f]{1,4}:|^[[:space:]]*$/d' spamhaus.txt 
129.130.100.100
1.160.118.30
91.121.120.228
62.210.111.59
52.90.253.169