从非格式化文本中提取 URL

从非格式化文本中提取 URL

我只找到了从 HTML 文件等格式化文本中提取子字符串的示例,但在我的例子中,我需要输出 URL 列表,例如:

... 
https://twitter.com/user1/status/xyza 
https://twitter.com/user1/status/xyzb
https://twitter.com/user1/status/xyzc
https://twitter.com/user2/status/xyza
https://twitter.com/user2/status/xyzb
...

来自非结构化且非常大的文件(+100 MB),这就是我的输入:

n          3\\n        \\n      \\n  \\n    \\n      \\n      Retweeted\\n    \\n      \\n        \\n          3\\n        \\n      \\n  \\n\\n      \\n  \\n    \\n      \\n        \\n      \\n      Like\\n    \\n      \\n        \\n          5\\n        \\n      \\n  \\n    \\n      \\n        \\n      \\n      Liked\\n    \\n      \\n        \\n          5\\n        \\n      \\n  \\n\\n      \\n\\n        \\n    \\n  \\n      \\n        \\n        More\\n      \\n  \\n  \\n  \\n    \\n    \\n  \\n  \\n    \\n      \\n        Copy link to Tweet\\n      \\n      \\n        Embed Tweet\\n      \\n        \\n  \\n\\n\\n\\n\\n  \\n\\n    \\n\\n      \\n\\n      \\n        \\n  \\n    \\n      \\n  \\n\\n      \\n    \\n\\n  \\n\\n\\n      \\n\\n\\n    \\n      \\n          \\n\\n    \\n        \\n          \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n        \\n        \\n  \\n    \\n  \\n      \\n\\n    \\n        \\n\\n    \\n\\n          Back to top ↑\\n\\n  \\n\\n\\n    \\n  \\n    \\n  \\n\\n\\n  \\n\\n\\n    \\n  \\n    Loading seems to be taking a while.\\n    \\n      Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.\\n    \\n  \\n\\n\\n\\n      \\n    \\n  \\n\\n      \\n    \\n\\n\\n\\n\\n\\n  \\n\\n\\n  \\n    \\n      Suggested by Twitter\\n      \\n        \\n      \\n    \\n   \\n\\n    \\n  \\n    \\n  \\n    \\n    false\\n  \\n  \\n    \\n    \\n  \\n\\n  \\n\\n\\n\\n  \\n      \\n  \\n    \\n      \\n        © 2015 Twitter\\n        About\\n        Help\\n        Terms\\n        Privacy\\n        Cookies\\n        Ads info\\n      \\n    \\n  \\n\\n\\n  \\n\\n\\n\\n      \\n    \\n  \\n\\n\\n    \\n  \\n  \\n\\n\\n\\n    \\n    \\n  \\n\\n  \\n\\n  \\n\\n    \\n  \\n\\n  \\n    \\n\\n\",\"meta_tags\":[{},{\"content\":\"0; URL=https://mobile.twitter.com/i/nojs_router?path=%2FTerriBauman%2Fstatus%2F680996161843380224\"},{\"name\":\"robots\",\"content\":\"NOODP\"},{\"name\":\"msapplication-TileImage\",\"content\":\"//abs.twimg.com/favicons/win8-tile-144.png\"},{\"name\":\"msapplication-TileColor\",\"content\":\"#00aced\"},{\"name\":\"swift-page-name\",\"content\":\"permalink\"},{\"content\":\"article\"},{\"content\":\"https://twitter.com/TerriBauman/status/680996161843380224\"},{\"content\":\"Terri Bauman on Twitter\"},{\"content\":\"https://pbs.twimg.com/media/BcaVtMKCEAAyz9f.jpg:large\"},{\"content\":\"true\"},{\"content\":\"“Social Media Jobs: https://t.co/NDDK4WaRA4 Please Retweet to spread words #OnlineJobs #Jobs”\"},{\"content\":\"Twitter\"},{\"content\":\"2231777543\"}],\"links\":[\"https://twitter.com/\",\"https://twitter.com/about\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/#supported_languages\",\"https://twitter.com/?lang=id\",\"https://twitter.com/?lang=msa\",\"https://twitter.com/?lang=cs\",\"https://twitter.com/?lang=da\",\"https://twitter.com/?lang=de\",\"https://twitter.com/?lang=en-gb\",\"https://twitter.com/?lang=es\",\"https://twitter.com/?lang=fil\",\"https://twitter.com/?lang=fr\",\"https://twitter.com/?lang=it\",\"https://twitter.com/?lang=hu\",\"https://twitter.com/?lang=nl\",\"https://twitter.com/?lang=no\",\"https://twitter.com/?lang=pl\",\"https://twitter.com/?lang=pt\",\"https://twitter.com/?lang=ro\",\"https://twitter.com/?lang=fi\",\"https://twitter.com/?lang=sv\",\"https://twitter.com/?lang=vi\",\"https://twitter.com/?lang=tr\",\"https://twitter.com/?lang=el\",\"https://twitter.com/?lang=ru\",\"https://twitter.com/?lang=uk\",\"https://twitter.com/?lang=he\",\"https://twitter.com/?lang=ar\",\"https://twitter.com/?lang=fa\",\"https://twitter.com/?lang=mr\",\"https://twitter.com/?lang=hi\",\"https://twitter.com/?lang=bn\",\"https://twitter.com/?lang=gu\",\"https://twitter.com/?lang=ta\",\"https://twitter.com/?lang=kn\",\"https://twitter.com/?lang=th\",\"https://twitter.com/?lang=ko\",\"https://twitter.com/?lang=ja\",\"https://twitter.com/?lang=zh-cn\",\"https://twitter.com/?lang=zh-tw\",\"https://twitter.com/login\",\"https://twitter.com/account/begin_password_reset\",\"https://twitter.com/signup\",\"https://twitter.com/TerriBauman\",\"https://pbs.twimg.com/profile_images/598412523734310913/t3ettYkj.jpg\",\"https://pbs.twimg.com/profile_images/598412523734310913/t3ettYkj.jpg\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/hashtag/Entrepreneur?src=hash\",\"https://twitter.com/hashtag/SocialMediaExpert?src=hash\",\"https://twitter.com/hashtag/SocialMediaMarketer?src=hash\",\"https://twitter.com/hashtag/BusinessOwner?src=hash\",\"https://twitter.com/hashtag/InternetMarketer?src=hash\",\"https://twitter.com/hashtag/SocialMediaJobs?src=hash\",\"https://t.co/ZciT91kZwP\",\"https://twitter.com/about\",\"http:////support.twitter.com\",\"https://twitter.com/tos\",\"https://twitter.com/privacy\",\"http:////support.twitter.com/articles/20170514\",\"http:////support.twitter.com/articles/20170451\",\"https://twitter.com/#\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"http://support.twitter.com/forums/26810/entries/78525\",\"http:////dev.twitter.com/docs/embedded-tweets\",\"http:////dev.twitter.com/docs/embedded-tweets\",\"https://twitter.com/account/begin_password_reset\",\"https://twitter.com/signup\",\"https://twitter.com/signup\",\"https://twitter.com/login\",\"http://support.twitter.com/articles/14226-how-to-find-your-twitter-short-code-or-long-code\",\"https://twitter.com/TerriBauman/status/680996164058001408\",\"https://twitter.com/TerriBauman/status/680977383365578752\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://t.co/NDDK4WaRA4\",\"https://twitter.com/hashtag/OnlineJobs?src=hash\",\"https://twitter.com/hashtag/Jobs?src=hash\",\"https://t.co/SJvkM1yWUI\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/cakafete\",\"https://twitter.com/KassemAlYateem\",\"https://twitter.com/Worldspacetech1\",\"https://twitter.com/ElisaBW\",\"https://twitter.com/patrickarrelle\",\"https://twitter.com/AcousticsPro1\",\"https://twitter.com/#\",\"http://status.twitter.com\",\"https://twitter.com/about\",\"http:////support.twitter.com\",\"https://twitter.com/tos\",\"https://twitter.com/privacy\",\"http:////support.twitter.com/articles/20170514\",\"http:////support.twitter.com/articles/20170451\"]}"},{"url":"http://status.twitter.com/page/2","result":"{\"date_crawled\":\"2015-12-27T10:01:58Z\",\"title\":\"Twitter Status\",\"lossyHTML\":\"\\n\\n\\r\\n\\r\\n    \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n            \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n                \\r\\n        \\r\\n\\r\\n        \\r\\n        Twitter Status\\r\\n        \\n\\r\\n        \\r\\n         \\r\\n\\r\\n        \\r\\n\\r\\n    \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\r\\n    \\r\\n\\r\\n\\r\\n\\r\\n\\r\\n        \\r\\n\\r\\n\\r\\n\\r\\n    \\r\\n    \\r\\n        \\r\\n            \\r\\n                \\r\\n                    Updates on the status of the Twitter service.\\r\\n\\r\\n\\r\\n\\r\\n\\r\\nRelated Links\\r\\nOfficial Company Blog\\r\\n\\r\\nOfficial Help Documents\\r\\n\\r\\nDeveloper Community\\r\\n\\r\\n\\r\\n\\r\\n                    Archive\\r\\n\\r\\n\\r\\n\\r\\n \\r\\n                    Powered by Tumblr\\r\\n                \\r\\n\\r\\n                \\r\\n            \\r\\n            \\r\\n\\r\\n\\r\\n            \\r\\n                \\r\\n                    \\r\\n       

我一直在尝试做:

grep 'https://' input.txt | grep 'status' >> output.txt

我见过 sed 和 awk 的使用示例,但除了极难理解之外,它们几乎总是基于列选择,而在我的情况下这是不可能的。

答案1

尝试使用 GNU grep 来获取带有两个斜杠的 URL:

grep -o 'http[s]*://[^/][^\\]*' file

带有两个或更多斜杠的 URL:

grep -o 'http[s]*://[^\\]*' file

推荐阅读:Stack Overflow 正则表达式常见问题解答

[s]*:星号量词 ( *) 表示前面的表达式可以匹配零次或多次。这里前面的表达式可以是字符类(用括号标记)中仅包含 的任何字符s。使用起来更方便s*

[^\\]*: 匹配除反斜杠之外的任何字符零次或多次。我用反斜杠转义了反斜杠以防止转义]

相关内容