如何使用 grep 或 sed 从 txt 中过滤数据？

Question 1

这里有几个选择。

KISS 方法使用两个 grep：

$ grep 'Auctions were started for' file | grep -o '\S*\.com'
wantit1.com
wantit2.com
wantit3.com
wantit4.com
sidefun.com
coffeetec.com
lifeout.com
new-fun-boys.com

更优雅：

$ perl -lne 'if (/"Auctions were started for (.*)"/) {print for split(/, | and /, $1)}' file
wantit1.com
wantit2.com
wantit3.com
wantit4.com
sidefun.com
coffeetec.com
lifeout.com
new-fun-boys.com

Answer

这里有几个选择。

KISS 方法使用两个 grep：

$ grep 'Auctions were started for' file | grep -o '\S*\.com'
wantit1.com
wantit2.com
wantit3.com
wantit4.com
sidefun.com
coffeetec.com
lifeout.com
new-fun-boys.com

更优雅：

$ perl -lne 'if (/"Auctions were started for (.*)"/) {print for split(/, | and /, $1)}' file
wantit1.com
wantit2.com
wantit3.com
wantit4.com
sidefun.com
coffeetec.com
lifeout.com
new-fun-boys.com

Question 2

对于您的特定输入，这将起作用：

grep -Po '\s[a-z1-9-]{2,}(?=\..{2,4})' file.txt

-P：使我们能够使用展望。
-o：仅显示匹配项。
\s：仅搜索以空格开头的
[a-z1-9-]{2,}后面跟着任意字母数字字符或连字符，至少 2 个或更多。
(?=\..{3})：以点和 2 到 4 个字符（域后缀）结尾，但不包括它。

输出如下：

wantit1  
wantit2  
wantit3  
wantit4  
sidefun  
coffeetec  
lifeout  
new-fun-boys

一个更好的想法（根据您的评论）是使用：

awk '(/2017-05-20/ && /Auctions were started/)' file.txt | grep -Po '\s[a-z1-9-]{1,}(?=\..{2,4})'

Answer

对于您的特定输入，这将起作用：

grep -Po '\s[a-z1-9-]{2,}(?=\..{2,4})' file.txt

-P：使我们能够使用展望。
-o：仅显示匹配项。
\s：仅搜索以空格开头的
[a-z1-9-]{2,}后面跟着任意字母数字字符或连字符，至少 2 个或更多。
(?=\..{3})：以点和 2 到 4 个字符（域后缀）结尾，但不包括它。

输出如下：

wantit1  
wantit2  
wantit3  
wantit4  
sidefun  
coffeetec  
lifeout  
new-fun-boys

一个更好的想法（根据您的评论）是使用：

awk '(/2017-05-20/ && /Auctions were started/)' file.txt | grep -Po '\s[a-z1-9-]{1,}(?=\..{2,4})'

Question 3

您可以通过以下方法轻松实现此目的：grep查找file.txt包含文本“Auctions were started for”的所有行，并sed仅提取不带 TLD 的域名并每行打印一个：

grep -Po '(?<="Auctions were started for ).*(?=")' file.txt | sed -r 's/and |,|.com//g;y/ /\n/'

该命令的具体内容如下：

grep -Po '(?<="Auctions were started for ).*(?=")' file.txt

这将逐行扫描file.txt并匹配任何.*以字符串开头"Auctions were started for并后跟另一个字符串的内容（） "。我们需要grep的-P选项来启用 PCRE 正则表达式（否则我们不能使用(?<=...)和(?=...)正则表达式环视），以及它的-o选项来仅打印行的匹配部分（不包括环视），而不是整行。

第二步，我们将第一个命令的输出导入到这个sed命令中：

sed -r 's/and |,|.com//g;y/ /\n/'

这一sed行实际上包含两个命令，s/and |,|.com//g和y/ /\n/。

首先，s/PATTERN/REPLACEMENT/搜索正则表达式（实际上是扩展正则表达式，因为有-r选项）模式and |,|.com，这意味着and，,或.com。然后它用空值替换它，因此这些模式实际上会从输入行中删除。最后g启用全局搜索和替换，而不仅仅是处理每一行的第一个匹配项。

第二，y/CHARACTERS/REPLACEMENTS/将第一个字段中的所有字符转换为第二个字段中的相应字符。这里我使用这个来简单地将所有剩余的空格转换为换行符。

Answer