将 xml 转换为 json 的脚本

将 xml 转换为 json 的脚本

我在 txt 文件中有 5000 个问题,如下所示:

<quiz>
        <que>The question her</que>
        <ca>text</ca>
        <ia>text</ia>
        <ia>text</ia>
        <ia>text</ia>
    </quiz>

我想在 Ubuntu 中编写一个脚本来转换所有问题,如下所示:

  {
   "text":"The question her",
   "answer1":"text",
   "answer2":"text",
   "answer3":"text",
   "answer4":"text"
  },

答案1

事实上,即使没有 Python 编程,你也可以摆脱这里,只需使用 2 个 unix 实用程序:

  1. jtm- 允许 xml <-> json 无损转换
  2. jtc- 允许操作 JSON

因此,假设您的 xml 位于 中file.xml,jtm 会将其转换为以下 json:

bash $ jtm file.xml 
[
   {
      "quiz": [
         {
            "que": "The question her"
         },
         {
            "ca": "text"
         },
         {
            "ia": "text"
         },
         {
            "ia": "text"
         },
         {
            "ia": "text"
         }
      ]
   }
]
bash $ 

然后,应用一系列 JSON 转换,您可以得到所需的结果:

bash $ jtm file.xml | jtc -w'<quiz>l:[1:][-2]' -ei echo { '"answer[-]"': {} }\; -i'<quiz>l:[1:]' | jtc -w'<quiz>l:[-1][:][0]' -w'<quiz>l:[-1][:]' -s | jtc -w'<quiz>l:' -w'<quiz>l:[0]' -s | jtc -w'<quiz>l: <>v' -u'"text"'
[
   {
      "answer1": "text",
      "answer2": "text",
      "answer3": "text",
      "answer4": "text",
      "text": "The question her"
   }
]
bash $ 

不过,由于涉及 shell 脚本(echo命令),它会比 Python 慢 - 对于 5000 个问题,我预计它会运行大约一分钟。 (在未来的版本中,jtc我计划甚至在静态指定的 JSON 中也允许插值,这样模板化就不需要外部 shell 脚本了,那么操作将会非常快)

如果您对语法感到好奇jtc,可以在这里找到用户指南:https://github.com/ldn-softdev/jtc/blob/master/User%20Guide.md

答案2

xq工具来自https://kislyuk.github.io/yq/将你的 XML 变成

{
  "quiz": {
    "que": "The question her",
    "ca": "text",
    "ia": [
      "text",
      "text",
      "text"
    ]
  }
}

只需使用恒等过滤器 ( xq . file.xml) 即可。

我们可以将其按摩成更接近您想要使用的形式

xq '.quiz | { text: .que, answers: .ia }' file.xml

哪个输出

{
  "text": "The question her",
  "answers": [
    "text",
    "text",
    "text"
  ]
}

要修复该answers位以便获得枚举键:

xq '.quiz |
    { text: .que } +
    (
        [
            range(.ia|length) as $i | { key: "answer\($i+1)", value: .ia[$i] }
        ] | from_entries
    )' file.xml

这通过迭代节点并手动生成一组键和值来添加枚举answer键和来自节点的值。然后使用它们将它们转换为真正的键值对,并将其添加到我们创建的原始对象 ( ) 中。iaiafrom_entries{ text: .que }

输出:

{
  "text": "The question her",
  "answer1": "text",
  "answer2": "text",
  "answer3": "text"
}

如果您的 XML 文档quiz在某个根节点下包含多个节点,则将上面的表达式更改为.quiz对每个节点进行转换,并且您可能希望将结果对象放入数组中:jq.[].quiz[]

xq '.[].quiz[] |
    [ { text: .que } +
    (
        [
            range(.ia|length) as $i | { key: "answer\($i+1)", value: .ia[$i] }
        ] | from_entries
    ) ]' file.xml

答案3

我假设你的 Ubuntu 已经安装了 python

#!/usr/bin/python3
import io
import json
import xml.etree.ElementTree

d = """<quiz>
        <que>The question her</que>
        <ca>text</ca>
        <ia>text</ia>
        <ia>text</ia>
        <ia>text</ia>
    </quiz>
"""

s = io.StringIO(d)
# root = xml.etree.ElementTree.parse("filename_here").getroot()
root = xml.etree.ElementTree.parse(s).getroot()
out = {}
i = 1
for child in root:
    name, value = child.tag, child.text
    if name == 'que':
        name = 'question'
    else:
        name = 'answer%s' % i
        i += 1
    out[name] = value

print(json.dumps(out))

保存它并chmod保存为可执行文件,您可以轻松修改以将文件作为输入而不仅仅是文本

编辑 好的,这是一个更完整的脚本:

#!/usr/bin/python3
import json
import sys
import xml.etree.ElementTree


def read_file(filename):
    root = xml.etree.ElementTree.parse(filename).getroot()
    return root


# assule we have a list of <quiz>, contained in some other element
def parse_quiz(quiz_element, out):
    i = 1
    tmp = {}
    for child in quiz_element:

        name, value = child.tag, child.text
        if name == 'que':
            name = 'question'
        else:
            name = 'answer%s' % i
            i += 1
        tmp[name] = value
    out.append(tmp)


def parse_root(root_element, out):
    for child in root_element:
        if child.tag == 'quiz':
            parse_quiz(child, out)


def convert_xml_to_json(filename):
    root = read_file(filename)
    out = []
    parse_root(root, out)
    print(json.dumps(out))


if __name__ == '__main__':
    if len(sys.argv) > 1:
        convert_xml_to_json(sys.argv[1])
    else:
        print("Usage: script <filename_with_xml>")

我创建了一个包含以下内容的文件,我将其命名为xmltest

<questions>
    <quiz>
        <que>The question her</que>
        <ca>text</ca>
        <ia>text</ia>
        <ia>text</ia>
        <ia>text</ia>
    </quiz>
     <quiz>
            <que>Question number 1</que>
            <ca>blabla</ca>
            <ia>stuff</ia>
    </quiz>
</questions>

quiz所以你有一个其他容器内部的列表。

现在,我像这样启动它: $ chmod u+x scratch.py,然后scratch.py filenamewithxml

这给了我答案:

$ ./scratch4.py xmltest
[{"answer3": "text", "answer2": "text", "question": "The question her", "answer4": "text", "answer1": "text"}, {"answer2": "stuff", "question": "Question number 1", "answer1": "blabla"}]

答案4

谢谢dgan,但是你的代码:1-在屏幕上打印输出而不是在json文件中,并且不支持encoding = utf-8,所以我更改它:

 ##!/usr/bin/python3
import json, codecs
import sys
import xml.etree.ElementTree


def read_file(filename):
    root = xml.etree.ElementTree.parse(filename).getroot()
    return root


# assule we have a list of <quiz>, contained in some other element
def parse_quiz(quiz_element, out):
    i = 1
    tmp = {}
    for child in quiz_element:

        name, value = child.tag, child.text
        if name == 'que':
            name = 'question'
        else:
            name = 'answer%s' % i
            i += 1
        tmp[name] = value
    out.append(tmp)


def parse_root(root_element, out):
    for child in root_element:
        if child.tag == 'quiz':
            parse_quiz(child, out)


def convert_xml_to_json(filename):
    root = read_file(filename)
    out = []
    parse_root(root, out)
    with open('data.json', 'w') as outfile:
        json.dump(out, codecs.getwriter('utf-8')(outfile), sort_keys=True, ensure_ascii=False)


if __name__ == '__main__':
    if len(sys.argv) > 1:
        convert_xml_to_json(sys.argv[1])
    else:
        print("Usage: script <filename_with_xml>")

`

相关内容