从具有任意文本的文本文件中提取 .json

从具有任意文本的文本文件中提取 .json

我有一个程序的输出,它给出了一些任意文本,其中包含 .json 内容,例如:

blablablabla
blablab some more text

blablablabla
blablab some more text
{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}


blablablabla
blablab some more text


blablablabla
blablab some more text

我想清理 .json 之外的文本以使用“jq”解析它。

我只需要这段文字:

{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}

谢谢!

答案1

sed '/^{/,/^}/!d' < input

{将提取以 开头的行和以 开头的下一行之间包含的文件部分}

pcregrep -Mo '(?s)(\{(?:[^{}"]++|"(?:\\.|[^"])*+"|(?1))*\})' < file

会提取顶级{...}s 对,无论它们在哪里,足够智能来处理像{"x":{"y":1}}(nested {}) 或{ "x}" }( }inside strings) 或{ "x\"}" }(escaped quote in strings) 这样的输入。

如果您没有并且无法安装pcregrep(随 PCRE 库一起提供),但您有grep使用 PCRE 构建的 GNU ,您可以替换为,grep -zo尽管它将整个文件加载到内存中。或者使用perl -l -0777 -ne 'print for m{regexp-above}g'.

相关内容