我只找到了从 HTML 文件等格式化文本中提取子字符串的示例,但在我的例子中,我需要输出 URL 列表,例如:
...
https://twitter.com/user1/status/xyza
https://twitter.com/user1/status/xyzb
https://twitter.com/user1/status/xyzc
https://twitter.com/user2/status/xyza
https://twitter.com/user2/status/xyzb
...
来自非结构化且非常大的文件(+100 MB),这就是我的输入:
n 3\\n \\n \\n \\n \\n \\n Retweeted\\n \\n \\n \\n 3\\n \\n \\n \\n\\n \\n \\n \\n \\n \\n \\n Like\\n \\n \\n \\n 5\\n \\n \\n \\n \\n \\n \\n \\n Liked\\n \\n \\n \\n 5\\n \\n \\n \\n\\n \\n\\n \\n \\n \\n \\n \\n More\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n Copy link to Tweet\\n \\n \\n Embed Tweet\\n \\n \\n \\n\\n\\n\\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n\\n \\n \\n\\n \\n\\n\\n \\n\\n\\n \\n \\n \\n\\n \\n \\n \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n \\n \\n \\n \\n \\n \\n\\n \\n \\n\\n \\n\\n Back to top ↑\\n\\n \\n\\n\\n \\n \\n \\n \\n\\n\\n \\n\\n\\n \\n \\n Loading seems to be taking a while.\\n \\n Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.\\n \\n \\n\\n\\n\\n \\n \\n \\n\\n \\n \\n\\n\\n\\n\\n\\n \\n\\n\\n \\n \\n Suggested by Twitter\\n \\n \\n \\n \\n \\n\\n \\n \\n \\n \\n \\n false\\n \\n \\n \\n \\n \\n\\n \\n\\n\\n\\n \\n \\n \\n \\n \\n © 2015 Twitter\\n About\\n Help\\n Terms\\n Privacy\\n Cookies\\n Ads info\\n \\n \\n \\n\\n\\n \\n\\n\\n\\n \\n \\n \\n\\n\\n \\n \\n \\n\\n\\n\\n \\n \\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n\\n\",\"meta_tags\":[{},{\"content\":\"0; URL=https://mobile.twitter.com/i/nojs_router?path=%2FTerriBauman%2Fstatus%2F680996161843380224\"},{\"name\":\"robots\",\"content\":\"NOODP\"},{\"name\":\"msapplication-TileImage\",\"content\":\"//abs.twimg.com/favicons/win8-tile-144.png\"},{\"name\":\"msapplication-TileColor\",\"content\":\"#00aced\"},{\"name\":\"swift-page-name\",\"content\":\"permalink\"},{\"content\":\"article\"},{\"content\":\"https://twitter.com/TerriBauman/status/680996161843380224\"},{\"content\":\"Terri Bauman on Twitter\"},{\"content\":\"https://pbs.twimg.com/media/BcaVtMKCEAAyz9f.jpg:large\"},{\"content\":\"true\"},{\"content\":\"“Social Media Jobs: https://t.co/NDDK4WaRA4 Please Retweet to spread words #OnlineJobs #Jobs”\"},{\"content\":\"Twitter\"},{\"content\":\"2231777543\"}],\"links\":[\"https://twitter.com/\",\"https://twitter.com/about\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/#supported_languages\",\"https://twitter.com/?lang=id\",\"https://twitter.com/?lang=msa\",\"https://twitter.com/?lang=cs\",\"https://twitter.com/?lang=da\",\"https://twitter.com/?lang=de\",\"https://twitter.com/?lang=en-gb\",\"https://twitter.com/?lang=es\",\"https://twitter.com/?lang=fil\",\"https://twitter.com/?lang=fr\",\"https://twitter.com/?lang=it\",\"https://twitter.com/?lang=hu\",\"https://twitter.com/?lang=nl\",\"https://twitter.com/?lang=no\",\"https://twitter.com/?lang=pl\",\"https://twitter.com/?lang=pt\",\"https://twitter.com/?lang=ro\",\"https://twitter.com/?lang=fi\",\"https://twitter.com/?lang=sv\",\"https://twitter.com/?lang=vi\",\"https://twitter.com/?lang=tr\",\"https://twitter.com/?lang=el\",\"https://twitter.com/?lang=ru\",\"https://twitter.com/?lang=uk\",\"https://twitter.com/?lang=he\",\"https://twitter.com/?lang=ar\",\"https://twitter.com/?lang=fa\",\"https://twitter.com/?lang=mr\",\"https://twitter.com/?lang=hi\",\"https://twitter.com/?lang=bn\",\"https://twitter.com/?lang=gu\",\"https://twitter.com/?lang=ta\",\"https://twitter.com/?lang=kn\",\"https://twitter.com/?lang=th\",\"https://twitter.com/?lang=ko\",\"https://twitter.com/?lang=ja\",\"https://twitter.com/?lang=zh-cn\",\"https://twitter.com/?lang=zh-tw\",\"https://twitter.com/login\",\"https://twitter.com/account/begin_password_reset\",\"https://twitter.com/signup\",\"https://twitter.com/TerriBauman\",\"https://pbs.twimg.com/profile_images/598412523734310913/t3ettYkj.jpg\",\"https://pbs.twimg.com/profile_images/598412523734310913/t3ettYkj.jpg\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/hashtag/Entrepreneur?src=hash\",\"https://twitter.com/hashtag/SocialMediaExpert?src=hash\",\"https://twitter.com/hashtag/SocialMediaMarketer?src=hash\",\"https://twitter.com/hashtag/BusinessOwner?src=hash\",\"https://twitter.com/hashtag/InternetMarketer?src=hash\",\"https://twitter.com/hashtag/SocialMediaJobs?src=hash\",\"https://t.co/ZciT91kZwP\",\"https://twitter.com/about\",\"http:////support.twitter.com\",\"https://twitter.com/tos\",\"https://twitter.com/privacy\",\"http:////support.twitter.com/articles/20170514\",\"http:////support.twitter.com/articles/20170451\",\"https://twitter.com/#\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"http://support.twitter.com/forums/26810/entries/78525\",\"http:////dev.twitter.com/docs/embedded-tweets\",\"http:////dev.twitter.com/docs/embedded-tweets\",\"https://twitter.com/account/begin_password_reset\",\"https://twitter.com/signup\",\"https://twitter.com/signup\",\"https://twitter.com/login\",\"http://support.twitter.com/articles/14226-how-to-find-your-twitter-short-code-or-long-code\",\"https://twitter.com/TerriBauman/status/680996164058001408\",\"https://twitter.com/TerriBauman/status/680977383365578752\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://t.co/NDDK4WaRA4\",\"https://twitter.com/hashtag/OnlineJobs?src=hash\",\"https://twitter.com/hashtag/Jobs?src=hash\",\"https://t.co/SJvkM1yWUI\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/cakafete\",\"https://twitter.com/KassemAlYateem\",\"https://twitter.com/Worldspacetech1\",\"https://twitter.com/ElisaBW\",\"https://twitter.com/patrickarrelle\",\"https://twitter.com/AcousticsPro1\",\"https://twitter.com/#\",\"http://status.twitter.com\",\"https://twitter.com/about\",\"http:////support.twitter.com\",\"https://twitter.com/tos\",\"https://twitter.com/privacy\",\"http:////support.twitter.com/articles/20170514\",\"http:////support.twitter.com/articles/20170451\"]}"},{"url":"http://status.twitter.com/page/2","result":"{\"date_crawled\":\"2015-12-27T10:01:58Z\",\"title\":\"Twitter Status\",\"lossyHTML\":\"\\n\\n\\r\\n\\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n\\r\\n \\r\\n Twitter Status\\r\\n \\n\\r\\n \\r\\n \\r\\n\\r\\n \\r\\n\\r\\n \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\r\\n \\r\\n\\r\\n\\r\\n\\r\\n\\r\\n \\r\\n\\r\\n\\r\\n\\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n Updates on the status of the Twitter service.\\r\\n\\r\\n\\r\\n\\r\\n\\r\\nRelated Links\\r\\nOfficial Company Blog\\r\\n\\r\\nOfficial Help Documents\\r\\n\\r\\nDeveloper Community\\r\\n\\r\\n\\r\\n\\r\\n Archive\\r\\n\\r\\n\\r\\n\\r\\n \\r\\n Powered by Tumblr\\r\\n \\r\\n\\r\\n \\r\\n \\r\\n \\r\\n\\r\\n\\r\\n \\r\\n \\r\\n \\r\\n
我一直在尝试做:
grep 'https://' input.txt | grep 'status' >> output.txt
我见过 sed 和 awk 的使用示例,但除了极难理解之外,它们几乎总是基于列选择,而在我的情况下这是不可能的。
答案1
尝试使用 GNU grep 来获取带有两个斜杠的 URL:
grep -o 'http[s]*://[^/][^\\]*' file
带有两个或更多斜杠的 URL:
grep -o 'http[s]*://[^\\]*' file
推荐阅读:Stack Overflow 正则表达式常见问题解答
[s]*
:星号量词 (*
) 表示前面的表达式可以匹配零次或多次。这里前面的表达式可以是字符类(用括号标记)中仅包含 的任何字符s
。使用起来更方便s*
。
[^\\]*
: 匹配除反斜杠之外的任何字符零次或多次。我用反斜杠转义了反斜杠以防止转义]
。