我正在尝试从 Twitter 获取数据,我可以读取每一行,但不知道使用什么命令来按我想要的方式过滤数据。有什么建议吗?
输入文件:file.txt
id,created_at,text
842433,2017-05-20 14:45:05,goldring.com was just registered https://t.co/xt9345d
336353,2017-05-20 14:45:04,stretch.com was just registered https://t.co/QBEX965hf
84244e,2017-05-20 14:45:03,"Auctions were started for wantit1.com, wantit2.com, wantit3.com and wantit4.com"
842434,2017-05-20 14:45:02,"Auctions were started for sidefun.com, coffeetec.com, lifeout.com and new-fun-boys.com"
预期输出:
wantit1
wantit2
wantit3
wantit4
sidefun
coffeetec
lifeout
new-fun-boys
我有的代码:
cat file.txt | while read line;
do
echo "$line" >> out1.txt
done
答案1
这里有几个选择。
KISS 方法使用两个 grep:
$ grep 'Auctions were started for' file | grep -o '\S*\.com'
wantit1.com
wantit2.com
wantit3.com
wantit4.com
sidefun.com
coffeetec.com
lifeout.com
new-fun-boys.com
更优雅:
$ perl -lne 'if (/"Auctions were started for (.*)"/) {print for split(/, | and /, $1)}' file
wantit1.com
wantit2.com
wantit3.com
wantit4.com
sidefun.com
coffeetec.com
lifeout.com
new-fun-boys.com
答案2
对于您的特定输入,这将起作用:
grep -Po '\s[a-z1-9-]{2,}(?=\..{2,4})' file.txt
-P
:使我们能够使用展望。-o
:仅显示匹配项。\s
:仅搜索以空格开头的[a-z1-9-]{2,}
后面跟着任意字母数字字符或连字符,至少 2 个或更多。(?=\..{3})
:以点和 2 到 4 个字符(域后缀)结尾,但不包括它。
输出如下:
wantit1
wantit2
wantit3
wantit4
sidefun
coffeetec
lifeout
new-fun-boys
一个更好的想法(根据您的评论)是使用:
awk '(/2017-05-20/ && /Auctions were started/)' file.txt | grep -Po '\s[a-z1-9-]{1,}(?=\..{2,4})'
答案3
您可以通过以下方法轻松实现此目的:grep
查找file.txt
包含文本“Auctions were started for”的所有行,并sed
仅提取不带 TLD 的域名并每行打印一个:
grep -Po '(?<="Auctions were started for ).*(?=")' file.txt | sed -r 's/and |,|.com//g;y/ /\n/'
该命令的具体内容如下:
grep -Po '(?<="Auctions were started for ).*(?=")' file.txt
这将逐行扫描file.txt
并匹配任何.*
以字符串开头"Auctions were started for
并后跟另一个字符串的内容() "
。我们需要grep
的-P
选项来启用 PCRE 正则表达式(否则我们不能使用(?<=...)
和(?=...)
正则表达式环视),以及它的-o
选项来仅打印行的匹配部分(不包括环视),而不是整行。
第二步,我们将第一个命令的输出导入到这个sed
命令中:
sed -r 's/and |,|.com//g;y/ /\n/'
这一sed
行实际上包含两个命令,s/and |,|.com//g
和y/ /\n/
。
首先,s/PATTERN/REPLACEMENT/
搜索正则表达式(实际上是扩展正则表达式,因为有-r
选项)模式and |,|.com
,这意味着and
,,
或.com
。然后它用空值替换它,因此这些模式实际上会从输入行中删除。最后g
启用全局搜索和替换,而不仅仅是处理每一行的第一个匹配项。
第二,y/CHARACTERS/REPLACEMENTS/
将第一个字段中的所有字符转换为第二个字段中的相应字符。这里我使用这个来简单地将所有剩余的空格转换为换行符。