我有一个程序的输出,它给出了一些任意文本,其中包含 .json 内容,例如:
blablablabla
blablab some more text
blablablabla
blablab some more text
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
blablablabla
blablab some more text
blablablabla
blablab some more text
我想清理 .json 之外的文本以使用“jq”解析它。
我只需要这段文字:
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
谢谢!
答案1
sed '/^{/,/^}/!d' < input
{
将提取以 开头的行和以 开头的下一行之间包含的文件部分}
。
pcregrep -Mo '(?s)(\{(?:[^{}"]++|"(?:\\.|[^"])*+"|(?1))*\})' < file
会提取顶级{...}
s 对,无论它们在哪里,足够智能来处理像{"x":{"y":1}}
(nested {}
) 或{ "x}" }
( }
inside strings) 或{ "x\"}" }
(escaped quote in strings) 这样的输入。
如果您没有并且无法安装pcregrep
(随 PCRE 库一起提供),但您有grep
使用 PCRE 构建的 GNU ,您可以替换为,grep -zo
尽管它将整个文件加载到内存中。或者使用perl -l -0777 -ne 'print for m{regexp-above}g'
.