结论

Question 1

好吧，我找到了一个对我有用的解决方案。该解决方案的最大问题是 XML 插件... 不是非常不稳定，而是文档记录不全且存在错误，或者文档记录不全且不正确。

结论

Bash 命令行：

gzcat -d file.xml.gz | tr -d "\n\r" | xmllint --format - | logstash -f logstash-csv.conf

Logstash 配置：

input {
    stdin {}
}

filter {
    # add all lines that have more indentation than double-space to the previous line
    multiline {
        pattern => "^\s\s(\s\s|\<\/entry\>)"
        what => previous
    }
    # multiline filter adds the tag "multiline" only to lines spanning multiple lines
    # We _only_ want those here.
    if "multiline" in [tags] {
        # Add the encoding line here. Could in theory extract this from the
        # first line with a clever filter. Not worth the effort at the moment.
        mutate {
            replace => ["message",'<?xml version="1.0" encoding="UTF-8" ?>%{message}']
        }
        # This filter exports the hierarchy into the field "entry". This will
        # create a very deep structure that elasticsearch does not really like.
        # Which is why I used add_field to flatten it.
        xml {
            target => entry
            source => message
            add_field => {
                fieldx         => "%{[entry][fieldx]}"
                fieldy         => "%{[entry][fieldy]}"
                fieldz         => "%{[entry][fieldz]}"
                # With deeper nested fields, the xml converter actually creates
                # an array containing hashes, which is why you need the [0]
                # -- took me ages to find out.
                fielda         => "%{[entry][fieldarray][0][fielda]}"
                fieldb         => "%{[entry][fieldarray][0][fieldb]}"
                fieldc         => "%{[entry][fieldarray][0][fieldc]}"
            }
        }
        # Remove the intermediate fields before output. "message" contains the
        # original message (XML). You may or may-not want to keep that.
        mutate {
            remove_field => ["message"]
            remove_field => ["entry"]
        }
    }
}

output {
    ...
}

详细的

我的解决方案有效，因为至少在entry级别之前，我的 XML 输入是非常统一，因此可以通过某种模式匹配来处理。

由于导出基本上是一行很长的 XML，并且 logstash xml 插件基本上只适用于包含 XML 数据的字段（读取：行中的列），所以我必须将数据更改为更有用的格式。

Shell：准备文件

gzcat -d file.xml.gz |：数据太多了——显然你可以跳过它
tr -d "\n\r" |：删除 XML 元素内的换行符：某些元素可以包含换行符作为字符数据。下一步需要这些代码会被删除，或者以某种方式编码。尽管它假设此时所有 XML 代码都在一行中，但此命令是否删除元素之间的任何空格并不重要

xmllint --format - |：使用 xmllint (libxml 附带) 格式化 XML

这里，一行巨大的意大利面条式 XML 代码（<root><entry><fieldx>...</fieldx></entry></root>）的格式是正确的：

<root>
  <entry>
    <fieldx>...</fieldx>
    <fieldy>...</fieldy>
    <fieldz>...</fieldz>
    <fieldarray>
      <fielda>...</fielda>
      <fieldb>...</fieldb>
      ...
    </fieldarray>
  </entry>
  <entry>
    ...
  </entry>
  ...
</root>

Logstash

logstash -f logstash-csv.conf

.conf（请参阅TL;DR 部分中的文件完整内容。）

在这里，multiline过滤器发挥了作用。它可以将多行合并为一条日志消息。这就是为什么xmllint需要使用进行格式化的原因：

filter {
    # add all lines that have more indentation than double-space to the previous line
    multiline {
        pattern => "^\s\s(\s\s|\<\/entry\>)"
        what => previous
    }
}

这基本上意味着每行缩进超过两个空格（或</entry>/ xmllint 默认缩进两个空格）都属于前一行。这也意味着字符数据不得包含换行符（tr在 shell 中被删除），并且 xml 必须规范化（xmllint）

Answer

好吧，我找到了一个对我有用的解决方案。该解决方案的最大问题是 XML 插件... 不是非常不稳定，而是文档记录不全且存在错误，或者文档记录不全且不正确。

结论

Bash 命令行：

gzcat -d file.xml.gz | tr -d "\n\r" | xmllint --format - | logstash -f logstash-csv.conf

Logstash 配置：

input {
    stdin {}
}

filter {
    # add all lines that have more indentation than double-space to the previous line
    multiline {
        pattern => "^\s\s(\s\s|\<\/entry\>)"
        what => previous
    }
    # multiline filter adds the tag "multiline" only to lines spanning multiple lines
    # We _only_ want those here.
    if "multiline" in [tags] {
        # Add the encoding line here. Could in theory extract this from the
        # first line with a clever filter. Not worth the effort at the moment.
        mutate {
            replace => ["message",'<?xml version="1.0" encoding="UTF-8" ?>%{message}']
        }
        # This filter exports the hierarchy into the field "entry". This will
        # create a very deep structure that elasticsearch does not really like.
        # Which is why I used add_field to flatten it.
        xml {
            target => entry
            source => message
            add_field => {
                fieldx         => "%{[entry][fieldx]}"
                fieldy         => "%{[entry][fieldy]}"
                fieldz         => "%{[entry][fieldz]}"
                # With deeper nested fields, the xml converter actually creates
                # an array containing hashes, which is why you need the [0]
                # -- took me ages to find out.
                fielda         => "%{[entry][fieldarray][0][fielda]}"
                fieldb         => "%{[entry][fieldarray][0][fieldb]}"
                fieldc         => "%{[entry][fieldarray][0][fieldc]}"
            }
        }
        # Remove the intermediate fields before output. "message" contains the
        # original message (XML). You may or may-not want to keep that.
        mutate {
            remove_field => ["message"]
            remove_field => ["entry"]
        }
    }
}

output {
    ...
}

详细的

我的解决方案有效，因为至少在entry级别之前，我的 XML 输入是非常统一，因此可以通过某种模式匹配来处理。

由于导出基本上是一行很长的 XML，并且 logstash xml 插件基本上只适用于包含 XML 数据的字段（读取：行中的列），所以我必须将数据更改为更有用的格式。

Shell：准备文件

gzcat -d file.xml.gz |：数据太多了——显然你可以跳过它
tr -d "\n\r" |：删除 XML 元素内的换行符：某些元素可以包含换行符作为字符数据。下一步需要这些代码会被删除，或者以某种方式编码。尽管它假设此时所有 XML 代码都在一行中，但此命令是否删除元素之间的任何空格并不重要

xmllint --format - |：使用 xmllint (libxml 附带) 格式化 XML

这里，一行巨大的意大利面条式 XML 代码（<root><entry><fieldx>...</fieldx></entry></root>）的格式是正确的：

<root>
  <entry>
    <fieldx>...</fieldx>
    <fieldy>...</fieldy>
    <fieldz>...</fieldz>
    <fieldarray>
      <fielda>...</fielda>
      <fieldb>...</fieldb>
      ...
    </fieldarray>
  </entry>
  <entry>
    ...
  </entry>
  ...
</root>

Logstash

logstash -f logstash-csv.conf

.conf（请参阅TL;DR 部分中的文件完整内容。）

在这里，multiline过滤器发挥了作用。它可以将多行合并为一条日志消息。这就是为什么xmllint需要使用进行格式化的原因：

filter {
    # add all lines that have more indentation than double-space to the previous line
    multiline {
        pattern => "^\s\s(\s\s|\<\/entry\>)"
        what => previous
    }
}

这基本上意味着每行缩进超过两个空格（或</entry>/ xmllint 默认缩进两个空格）都属于前一行。这也意味着字符数据不得包含换行符（tr在 shell 中被删除），并且 xml 必须规范化（xmllint）

Question 2

我遇到过类似的情况。要解析此 xml：

<ROOT number="34">
  <EVENTLIST>
    <EVENT name="hey"/>
    <EVENT name="you"/>
  </EVENTLIST>
</ROOT>

我使用这个配置来logstash：

input {
  file {
    path => "/path/events.xml"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    codec => multiline {
      pattern => "<ROOT"
      negate => "true"
      what => "previous"
      auto_flush_interval => 1
    }
  }
}
filter {
  xml {
    source => "message"
    target => "xml_content"
  }
  split {
    field => "xml_content[EVENTLIST]"
  }
  split {
    field => "xml_content[EVENTLIST][EVENT]"
  }
  mutate {
    add_field => { "number" => "%{xml_content[number]}" }
    add_field => { "name" => "%{xml_content[EVENTLIST][EVENT][name]}" }
    remove_field => ['xml_content', 'message', 'path']
  }
}
output {
  stdout {
    codec => rubydebug
  }
}

我希望这能对某人有所帮助。我花了很长时间才得到它。

Answer

我遇到过类似的情况。要解析此 xml：

<ROOT number="34">
  <EVENTLIST>
    <EVENT name="hey"/>
    <EVENT name="you"/>
  </EVENTLIST>
</ROOT>

我使用这个配置来logstash：

input {
  file {
    path => "/path/events.xml"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    codec => multiline {
      pattern => "<ROOT"
      negate => "true"
      what => "previous"
      auto_flush_interval => 1
    }
  }
}
filter {
  xml {
    source => "message"
    target => "xml_content"
  }
  split {
    field => "xml_content[EVENTLIST]"
  }
  split {
    field => "xml_content[EVENTLIST][EVENT]"
  }
  mutate {
    add_field => { "number" => "%{xml_content[number]}" }
    add_field => { "name" => "%{xml_content[EVENTLIST][EVENT][name]}" }
    remove_field => ['xml_content', 'message', 'path']
  }
}
output {
  stdout {
    codec => rubydebug
  }
}

我希望这能对某人有所帮助。我花了很长时间才得到它。

结论

答案1

结论

详细的

Shell：准备文件

Logstash

答案2

相关内容