从 XML 中提取 SMS 数据

Question 1

假设 XML 格式良好并且所有chat节点都出现在某个单个root节点下，那么您可能会使用xq( 的分布的一部分yq，来自https://kislyuk.github.io/yq/）：

xq -r '["address","messageBody","messageTime"], (.root.chat[] | [.address,.messageBody,.messageTime]) | @csv' file.xml

通过添加缺少的开始和结束标签来纠正问题中损坏的 XML，这将产生以下 CSV 输出：

"address","messageBody","messageTime"
,,"1624297248761"
"447917504050","Yeah mate let's do lunch and catch up.","1629944007697"
"447917563330","You going now mate",

Answer

假设 XML 格式良好并且所有chat节点都出现在某个单个root节点下，那么您可能会使用xq( 的分布的一部分yq，来自https://kislyuk.github.io/yq/）：

xq -r '["address","messageBody","messageTime"], (.root.chat[] | [.address,.messageBody,.messageTime]) | @csv' file.xml

通过添加缺少的开始和结束标签来纠正问题中损坏的 XML，这将产生以下 CSV 输出：

"address","messageBody","messageTime"
,,"1624297248761"
"447917504050","Yeah mate let's do lunch and catch up.","1629944007697"
"447917563330","You going now mate",

Question 2

其他xmlstarlet输出逗号分隔数据的答案：

xmlstarlet sel -t -m //chat -v messageTime -o , -v address -o , -v messageBody -n file.xml

1624297248761,,
1629944007697,447917504050,Yeah mate let's do lunch and catch up.
,447917563330,You going now mate

这是放置消息正文最后的这样逗号分隔的数据会将第三个字段到最后作为正文。

消息时间是自 1970-01-01 00:00:00 UTC 以来的毫秒数。处理它的一种方法是使用 GNU awk：

xmlstarlet sel -t -m //chat -v messageTime -o , -v address -o , -v messageBody -n file.xml \
| TZ=UTC gawk 'BEGIN {FS = OFS = ","} {$1 = strftime("%F %T", $1 / 1000)} 1'

输出

2021-06-21 17:40:48,,
2021-08-26 02:13:27,447917504050,Yeah mate let's do lunch and catch up.
1970-01-01 00:00:00,447917563330,You going now mate

这种格式可以很容易地按时间顺序排序。

Answer

其他xmlstarlet输出逗号分隔数据的答案：

xmlstarlet sel -t -m //chat -v messageTime -o , -v address -o , -v messageBody -n file.xml

1624297248761,,
1629944007697,447917504050,Yeah mate let's do lunch and catch up.
,447917563330,You going now mate

这是放置消息正文最后的这样逗号分隔的数据会将第三个字段到最后作为正文。

消息时间是自 1970-01-01 00:00:00 UTC 以来的毫秒数。处理它的一种方法是使用 GNU awk：

xmlstarlet sel -t -m //chat -v messageTime -o , -v address -o , -v messageBody -n file.xml \
| TZ=UTC gawk 'BEGIN {FS = OFS = ","} {$1 = strftime("%F %T", $1 / 1000)} 1'

输出

2021-06-21 17:40:48,,
2021-08-26 02:13:27,447917504050,Yeah mate let's do lunch and catch up.
1970-01-01 00:00:00,447917563330,You going now mate

这种格式可以很容易地按时间顺序排序。

Question 3

由于 XML 不正确，如注释中所述，请将所有文本包装在新标签中，如下所示：

<?xml version="1.0"?>
<myxml>
  <chat>
  ....your data which already includes </chat><chat> 
  </chat>
</myxml>

那么你可以xmlstarlet像这样使用（例如获取地址）：

xmlstarlet select --template --value-of /myxml/chat/address --nl input_file.xml

（input_file.xml应包含带有如上所述额外标签的数据）

更多例子这里

Answer