我正在编写一个 shell 脚本来生成目录列表。
作为输入 a 接收一个长 html 字符串:
https://www.mycompany.com/posts/aureliaflore_china-seoul-startup-activity-6571925510337728512-acAw","$type":"com.traver.voyager.feed.actions.Action"},
link to post","url":"https://www.mycompany.com/posts/aureliaflore_reuters-top-news-on-twitter-activity-6571392661482233856-T3dO","$type":
article","$type":"com.traver.voyager.feed.actions.Action"},{"actionType":"SHARE_VIA","text":"Copy link to post","url":"https://www.mycompany.com/posts/aureliaflore_are-you-thinking-to-the-benefits-of-digitalization-activity-6570119712154451968-927T","$type":"com.traver.voyager
为了使输出易于定制,脚本只显示一个 url-table :
https://www.mycompany.com/posts/aureliaflore_china-seoul-startup-activity-6571925510337728512-acAw
https://www.mycompany.com/posts/aureliaflore_reuters-top-news-on-twitter-activity-6571392661482233856-T3dO
https://www.mycompany.com/posts/aureliaflore_are-you-thinking-to-the-benefits-of-digitalization-activity-6570119712154451968-927T
搜索模式是:以“开始”https://www." 然后 XXXXX 字母(动态大小)然后以 " 结尾(不提取引号)
我当前的解决方案基于 cut -f 但总输入大小是动态的,因此不可能找到模式。
答案1
您的示例数据看起来像是 json 的损坏片段,因此您确实应该使用jq
它来从中提取您需要的内容前做任何你对原始输入所做的事情,导致它看起来像这样。
但是,要从您拥有的内容中提取以https://www
双引号字符开头且不包含双引号字符的 URL,您可以使用grep
:
$ grep -o 'https://www[^"]*' input.txt
https://www.mycompany.com/posts/aureliaflore_china-seoul-startup-activity-6571925510337728512-acAw
https://www.mycompany.com/posts/aureliaflore_reuters-top-news-on-twitter-activity-6571392661482233856-T3dO
https://www.mycompany.com/posts/aureliaflore_are-you-thinking-to-the-benefits-of-digitalization-activity-6570119712154451968-927T