假设我有以下文件:
A random Title 1
BLOCK
1- a block of text that can contain any character
and it also can contain multiple lines
BLOCK
A random Title 2
BLOCK
2- a block of text that can contain any character
and it also can contain multiple lines
BLOCK
A random Title 3
BLOCK
3- a block of text that can contain any character
and it also can contain multiple lines
BLOCK
该文件可以有多个像这样的文本块。我想通过以下 JSON 分发此文本中的参数:
[
{
"title": "A random Title 1",
"body": "1- a block of text that can contain any character\nand it also can contain multiple lines"
},
{
"title": "A random Title 2",
"body": "2- a block of text that can contain any character\nand it also can contain multiple lines"
},
{
"title": "A random Title 3",
"body": "3- a block of text that can contain any character\nand it also can contain multiple lines"
}
]
我知道我可以通过创建一个循环来解决这个问题,该循环在该文件中逐个字符地进行,然后我可以创建逻辑来正确划分 JSON 中的所有变量。但我想知道是否有更简单的使用命令行的解决方案。我可以使用 AWK 来分发在 JSON 输出上的文件中获取的参数吗?或者我在这种情况下误解了 AWK 的功能?
答案1
你可以尝试磨坊主
$ mlr --inidx --irs '\n\n' --ifs 'BLOCK' --ojson --jvstack --jlistwrap \
put -S 'for(k,v in $*){$[k] = strip(v)}' then \
cut -f 1,2 then \
rename '1,title,2,body' file
[
{
"title": "A random Title 1",
"body": "1- a block of text that can contain any character\nand it also can contain multiple lines"
}
,{
"title": "A random Title 2",
"body": "2- a block of text that can contain any character\nand it also can contain multiple lines"
}
,{
"title": "A random Title 3",
"body": "3- a block of text that can contain any character\nand it also can contain multiple lines"
}
]
您可以通过管道传输来美化输出jq '.'
- 或者省略--jvstack --jlistwrap
选项并通过管道传输jq -s '.'
:
$ mlr --inidx --irs '\n\n' --ifs 'BLOCK' --ojson \
put -S 'for(k,v in $*){$[k] = strip(v)}' then \
cut -f 1,2 then
rename '1,title,2,body' file | jq -s '.'
[
{
"title": "A random Title 1",
"body": "1- a block of text that can contain any character\nand it also can contain multiple lines"
},
{
"title": "A random Title 2",
"body": "2- a block of text that can contain any character\nand it also can contain multiple lines"
},
{
"title": "A random Title 3",
"body": "3- a block of text that can contain any character\nand it also can contain multiple lines"
}
]
这cut -f 1,2
是唯一必要的,因为第二个BLOCK
标记意味着第三个(空)字段 - 如果您愿意,可以用动词替换它remove-empty-columns
(尽管后者是非流式传输)。
处理体内连续的换行符
不幸的是,上面没有区分作为输入记录分隔符的连续换行符和可能出现在BLOCK
...BLOCK
主体分隔符之间的连续换行符。作为解决方法,您可以预处理输入以按\n
序列替换正文中的换行符,然后在写出 JSON 之前替换回文字换行符(\n
米勒将在其中转义回它们):
sed '/^BLOCK/{:a;N;/BLOCK$/!ba;s/\n/\\n/g;}' file |
mlr --inidx --irs '\n\n' --ifs 'BLOCK' --ojson put -S '$2 = gsub($2,"\\n","\n"); for(k,v in $*){$[k] = strip(v)}' then cut -f 1,2 then rename '1,title,2,body'
您可以将 sed 过滤器作为--prepipe
命令传递给 Miller,但引用会变得棘手。
答案2
在里面TXR语言,我们可以这样做:
$ txr data.txr data
[{"title":"A random Title 1","body":"1- a block of text that can contain any character\nand it also can contain multiple lines"},
{"title":"A random Title 2","body":"2- a block of text that can contain any character\nand it also can contain multiple lines"},
{"title":"A random Title 3","body":"3- a block of text that can contain any character\nand it also can contain multiple lines"}]
其中代码data.txr
是:
@(bind vec @(vec))
@(repeat)
@title
BLOCK
@(collect)
@lines
@(until)
BLOCK
@(end)
@(cat lines "\n")
@(do (vec-push vec #J^{"title" : ~title, "body" : ~lines}))
@(end)
@(do (put-jsonl vec))
我们构建一个哈希向量:与所需 JSON 相对应的底层数据结构。
前缀#J
表示嵌入 Lisp 中的 JSON 文字。在这里,我们有 a^
表示文字正在被准引用;~
字符表示将值插入到模板中的取消引号:标题和根据收集的用换行符连接字符串的行计算正文的表达式。
put-jsonl
意思是put-json
,后面有一个换行符。默认情况下,在*stdout*
流上。
建议使用缩进,如下所示:
@(bind vec @(vec))
@(repeat)
@ title
BLOCK
@ (collect)
@ lines
@ (until)
BLOCK
@ (end)
@ (cat lines "\n")
@ (do (vec-push vec #J^{"title" : ~title, "body" : ~lines}))
@(end)
@(do (put-jsonl vec))
这可以通过某种 awk 来完成; TXR Lisp 中的 Awk 宏:
$ txr data.tl data
[{"title":"A random Title 1","body":"1- a block of text that can contain any character\nand it also can contain multiple lines"},
{"title":"A random Title 2","body":"2- a block of text that can contain any character\nand it also can contain multiple lines"},
{"title":"A random Title 3","body":"3- a block of text that can contain any character\nand it also can contain multiple lines"}]
代码:
(awk
(:set rs "\n\n" fs "\n")
(:let (vec (vec)))
((and (equal [f 1] "BLOCK")
(equal [f -1] "BLOCK"))
(vec-push vec #J^{"title":~[f 0], "body":~(cat-str [f 2..-1])})
(next))
(t (error "bad data"))
(:end (put-jsonl vec)))
该(:set ...)
块用于初始化,我们用它来设置记录分隔符rs
和字段分隔符fs
,它们类似于原始的 AwkRS
和FS
。使用换行符字段分隔符和双换行符记录分隔符,我们将每个信息块作为一条记录,其字段如下所示:
"title" "BLOCK" "body1" "body2" ... "bodyn" "BLOCK"
在 awk 宏中,字段以名为 的列表形式提供f
。
主要逻辑是(条件动作)对。条件是:
(and (equal [f 1] "BLOCK") (equal [f -1] "BLOCK"))
f
如果 的第二个元素和最后一个元素是字符串,则为 true "BLOCK"
。如果这是真的,则执行该操作,该操作会提取片段并在vec
JSON 准引号的帮助下添加一个项目,就像第一个程序中一样。我们还执行(next)
移动到下一条记录,以避免遇到下一个条件-动作对。
下一个条件-操作对(t (error ...))
始终执行,因为t
为 true,并引发异常。
我们在块中打印 JSON (:end ..)
,这就像END { ... }
经典的 Awk 中一样。
说到错误检查,第一个程序在一定程度上容忍坏数据;有一些方法可以对其进行微调以拒绝错误的输入。例如,记录之间可能存在被静默跳过的垃圾,如果最后一个结束块丢失,那也没关系。
答案3
在每个 Unix 机器上的任何 shell 中使用任何 awk:
$ cat tst.awk
BEGIN {
RS = ""
FS = "\n"
printf "["
}
{
gsub(/"/,"\\\\&")
title = $1
body = $3
for (i=4; i<NF; i++) {
body = body "\\n" $i
}
print (NR>1 ? "," : "")
print " {"
printf " \"title\": \"%s\",\n", title
printf " \"body\": \"%s\"\n", body
printf " }"
}
END {
print "\n]"
}
$ awk -f tst.awk file
[
{
"title": "A random Title 1",
"body": "1- a block of text that can contain any character\nand it also can contain multiple lines"
},
{
"title": "A random Title 2",
"body": "2- a block of text that can contain any character\nand it also can contain multiple lines"
},
{
"title": "A random Title 3",
"body": "3- a block of text that can contain any character\nand it also can contain multiple lines"
}
]
答案4
Python 与itertools
模块一起将输入分组为块/段落,然后使用json
带有 dumps 方法的模块以 json 样式打印它们。
python3 -c 'import sys, json, itertools
ifile,rs = sys.argv[1],chr(10)
lod = []
with open(ifile) as fh:
for k,g in itertools.groupby(fh, lambda x: x == rs):
if not k:
para = list(g)
title,x,*body = list(map(lambda x: x.rstrip(rs),para[0:-1]))
lod.append({
"title": title,
"body": rs.join(body)
})
print(json.dumps(lod, sort_keys=False, indent=4))
' file
输出:
[
{
"title": "A random Title 1",
"body": "1- a block of text that can contain any character\nand it also can contain multiple lines"
},
{
"title": "A random Title 2",
"body": "2- a block of text that can contain any character\nand it also can contain multiple lines"
},
{
"title": "A random Title 3",
"body": "3- a block of text that can contain any character\nand it also can contain multiple lines"
}
]