sed

sed

我有filename.json。如果我在终端中解析它

file filename.json

输出是:

filename.json: UTF-8 Unicode text, with very long lines  

wc -l filename.json    
1 filename.json

如果我将其解析为jsonusing,jq那么我将不得不提及我希望它打印的数据部分,例如 id、摘要、作者等。我有数千个结构相似的 json,但我希望数据存储为“摘要”、“描述”、“评论”等。由于有数千个 JSON 文件,我不想检查每一个文件。但我知道我想要的数据位于两种模式之间

“标题”:“网址”:

$ cat filename.json

给出:

{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},

所以,我想打印模式之间的所有内容,但在终端中,文件只有 1 行,并且模式出现多次。我能想到的唯一方法是在两个模式之间打印直到文件末尾。

我尝试使用 sed:

sed -n '^/title/,/^url/p' filename.json

但打印出来是空白的。

我希望将数据进一步输入以使用机器学习技术进行语言分析。

关于在图案之间打印的其他方式的任何建议,以及图案重复多次。因此,我希望在每次重复之间打印数据。

预期结果是打印为 CSV 或 tsv:

1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."

2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."

etc,.

直到文件末尾。

答案1

长话短说

在 ksh、bash、zsh 中:

sed -e $'s,"title":,\1,g' -e $'s,"url":,\2,g' -e $'s,^[^\1]*,,' -e $'
         s,\1\\([^\2]*\\)\2[^\1]*,\\1\\\n,g' infile

sed

一个字符分隔符。

规范的解决方案一个字符@假设分隔符#是:

sed 's,^[^@]*,,;s,@\([^#]*\)#[^@]*,\1 ,g' infile

这将 - 从开头删除不是 a 的每个字符@ - 提取之间的字符第一的 @ 到下一个第一的 #接下来。

对于每个线输入文件的infile.

通用分隔符。

任何其他分隔符都可以通过简单地将每个分隔符字符串转换为上面的答案来转换特点。

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@\([^#]*\)#[^@]*/\1 /g' infile

在您的情况下,您可以使用换行符来代替空格 ( \1) ,为 GNU sed 编写的换行符很简单 ( \1\n):

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@\([^#]*\)#[^@]*/\1\n/g' infile

对于其他(较旧的)sed 添加显式换行符:

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@\([^#]*\)#[^@]*/\1\
/g' infile

如果存在上面使用的分隔符可能位于文件内部的风险,请选择其他不存在于文件内部的分隔符。如果这似乎是一个问题,则开始和结束分隔符可以是控制字符,例如Ctrl- A(或编码:^A、十六进制:Ox01或八进制\001)。您可以通过键入Ctrl- V Ctrl-在 shell 控制台中输入该内容A。您将在命令行中看到^A:

sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A\([^^B]*\)^B[^^A]*,\1\n,g' infile

或者,如果输入太麻烦,可以使用 (ksh,bash,zsh):

sed -e $'s,"title":,\1,g' -e $'s,"url":,\2,g' -e $'s,^[^\1]*,,' -e $'s,\1\\([^\2]*\\)\2[^\1]*,\\1\\\n,g' infile

或者,如果您的 sed 支持它:

sed -e 's,"title":,\o001,g' -e 's,"url":,\o002,g' -e 's,^[^\o001]*,,' -e 's,\o001\([^\o002]*\)\o002[^\o001]*,\1\o012,g' infile

如果分隔符是“描述”:

如果起始标签实际上是"description":(来自您的输出示例),只需使用它而不是"title":

上面的输出(来自您之前在问题中链接的文件):

"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

如果您需要对行进行编号,请再次使用 sed sed -n '=;p;g;p'

| sed -n '=;p;g;p'
1
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",

2
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",

3
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

AWK

在awk中实现的类似逻辑:

awk -vone=$'\1' -vtwo=$'\2' '{
            gsub(/"title":/,one);
            gsub(/"url":/,two);
            sub("^[^"one"]*"one,"")
            gsub(two"[^"one"]*"one,ORS)
            sub(two"[^"two"]*$","")
           } 1' infile

相关内容