如何使用 grep 或 sed 从 txt 中过滤数据?

如何使用 grep 或 sed 从 txt 中过滤数据?

我正在尝试从 Twitter 获取数据,我可以读取每一行,但不知道使用什么命令来按我想要的方式过滤数据。有什么建议吗?

输入文件:file.txt

id,created_at,text
842433,2017-05-20 14:45:05,goldring.com was just registered https://t.co/xt9345d
336353,2017-05-20 14:45:04,stretch.com was just registered https://t.co/QBEX965hf
84244e,2017-05-20 14:45:03,"Auctions were started for wantit1.com, wantit2.com, wantit3.com and wantit4.com"
842434,2017-05-20 14:45:02,"Auctions were started for sidefun.com, coffeetec.com, lifeout.com and new-fun-boys.com"

预期输出:

wantit1
wantit2
wantit3
wantit4
sidefun
coffeetec
lifeout
new-fun-boys

我有的代码:

cat file.txt | while read line; 
do

echo "$line"  >> out1.txt

done

答案1

这里有几个选择。

KISS 方法使用两个 grep:

$ grep 'Auctions were started for' file | grep -o '\S*\.com'
wantit1.com
wantit2.com
wantit3.com
wantit4.com
sidefun.com
coffeetec.com
lifeout.com
new-fun-boys.com

更优雅:

$ perl -lne 'if (/"Auctions were started for (.*)"/) {print for split(/, | and /, $1)}' file
wantit1.com
wantit2.com
wantit3.com
wantit4.com
sidefun.com
coffeetec.com
lifeout.com
new-fun-boys.com

答案2

对于您的特定输入,这将起作用:

grep -Po '\s[a-z1-9-]{2,}(?=\..{2,4})' file.txt
  • -P:使我们能够使用展望。
  • -o:仅显示匹配项。
  • \s:仅搜索以空格开头的
  • [a-z1-9-]{2,}后面跟着任意字母数字字符或连字符,至少 2 个或更多。
  • (?=\..{3}):以点和 2 到 4 个字符(域后缀)结尾,但不包括它。

输出如下:

wantit1  
wantit2  
wantit3  
wantit4  
sidefun  
coffeetec  
lifeout  
new-fun-boys  

一个更好的想法(根据您的评论)是使用:

awk '(/2017-05-20/ && /Auctions were started/)' file.txt | grep -Po '\s[a-z1-9-]{1,}(?=\..{2,4})'

答案3

您可以通过以下方法轻松实现此目的:grep查找file.txt包含文本“Auctions were started for”的所有行,并sed仅提取不带 TLD 的域名并每行打印一个:

grep -Po '(?<="Auctions were started for ).*(?=")' file.txt | sed -r 's/and |,|.com//g;y/ /\n/'

该命令的具体内容如下:

grep -Po '(?<="Auctions were started for ).*(?=")' file.txt

这将逐行扫描file.txt并匹配任何.*以字符串开头"Auctions were started for并后跟另一个字符串的内容() "。我们需要grep-P选项来启用 PCRE 正则表达式(否则我们不能使用(?<=...)(?=...)正则表达式环视),以及它的-o选项来仅打印行的匹配部分(不包括环视),而不是整行。

第二步,我们将第一个命令的输出导入到这个sed命令中:

sed -r 's/and |,|.com//g;y/ /\n/'

这一sed行实际上包含两个命令,s/and |,|.com//gy/ /\n/

首先,s/PATTERN/REPLACEMENT/搜索正则表达式(实际上是扩展正则表达式,因为有-r选项)模式and |,|.com,这意味着and,.com。然后它用空值替换它,因此这些模式实际上会从输入行中删除。最后g启用全局搜索和替换,而不仅仅是处理每一行的第一个匹配项。

第二,y/CHARACTERS/REPLACEMENTS/将第一个字段中的所有字符转换为第二个字段中的相应字符。这里我使用这个来简单地将所有剩余的空格转换为换行符。

相关内容