处理体内连续的换行符

处理体内连续的换行符

假设我有以下文件:

A random Title 1
BLOCK
1- a block of text that can contain any character
and it  also can contain multiple lines
BLOCK

A random Title 2
BLOCK
2- a block of text that can contain any character
and it  also can contain multiple lines
BLOCK

A random Title 3
BLOCK
3- a block of text that can contain any character
and it  also can contain multiple lines
BLOCK

该文件可以有多个像这样的文本块。我想通过以下 JSON 分发此文本中的参数:

[
    {
        "title": "A random Title 1",
        "body": "1- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 2",
        "body": "2- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 3",
        "body": "3- a block of text that can contain any character\nand it  also can contain multiple lines"
    }
]

我知道我可以通过创建一个循环来解决这个问题,该循环在该文件中逐个字符地进行,然后我可以创建逻辑来正确划分 JSON 中的所有变量。但我想知道是否有更简单的使用命令行的解决方案。我可以使用 AWK 来分发在 JSON 输出上的文件中获取的参数吗?或者我在这种情况下误解了 AWK 的功能?

答案1

你可以尝试磨坊主

$ mlr --inidx --irs '\n\n' --ifs 'BLOCK' --ojson --jvstack --jlistwrap \
    put -S 'for(k,v in $*){$[k] = strip(v)}' then \
    cut -f 1,2 then \
    rename '1,title,2,body' file
[
{
  "title": "A random Title 1",
  "body": "1- a block of text that can contain any character\nand it  also can contain multiple lines"
}
,{
  "title": "A random Title 2",
  "body": "2- a block of text that can contain any character\nand it  also can contain multiple lines"
}
,{
  "title": "A random Title 3",
  "body": "3- a block of text that can contain any character\nand it  also can contain multiple lines"
}
]

您可以通过管道传输来美化输出jq '.'- 或者省略--jvstack --jlistwrap选项并通过管道传输jq -s '.'

$ mlr --inidx --irs '\n\n' --ifs 'BLOCK' --ojson \
    put -S 'for(k,v in $*){$[k] = strip(v)}' then \
    cut -f 1,2 then 
    rename '1,title,2,body' file | jq -s '.'
[
  {
    "title": "A random Title 1",
    "body": "1- a block of text that can contain any character\nand it  also can contain multiple lines"
  },
  {
    "title": "A random Title 2",
    "body": "2- a block of text that can contain any character\nand it  also can contain multiple lines"
  },
  {
    "title": "A random Title 3",
    "body": "3- a block of text that can contain any character\nand it  also can contain multiple lines"
  }
]

cut -f 1,2是唯一必要的,因为第二个BLOCK标记意味着第三个(空)字段 - 如果您愿意,可以用动词替换它remove-empty-columns(尽管后者是非流式传输)。


处理体内连续的换行符

不幸的是,上面没有区分作为输入记录分隔符的连续换行符和可能出现在BLOCK...BLOCK主体分隔符之间的连续换行符。作为解决方法,您可以预处理输入以按\n序列替换正文中的换行符,然后在写出 JSON 之前替换回文字换行符(\n米勒将在其中转义回它们):

sed '/^BLOCK/{:a;N;/BLOCK$/!ba;s/\n/\\n/g;}' file | 
  mlr --inidx --irs '\n\n' --ifs 'BLOCK' --ojson put -S '$2 = gsub($2,"\\n","\n"); for(k,v in $*){$[k] = strip(v)}' then cut -f 1,2 then rename '1,title,2,body'

您可以将 sed 过滤器作为--prepipe命令传递给 Miller,但引用会变得棘手。

答案2

在里面TXR语言,我们可以这样做:

$ txr data.txr data
[{"title":"A random Title 1","body":"1- a block of text that can contain any character\nand it  also can contain multiple lines"},
 {"title":"A random Title 2","body":"2- a block of text that can contain any character\nand it  also can contain multiple lines"},
 {"title":"A random Title 3","body":"3- a block of text that can contain any character\nand it  also can contain multiple lines"}]

其中代码data.txr是:

@(bind vec @(vec))
@(repeat)
@title
BLOCK
@(collect)
@lines
@(until)
BLOCK
@(end)
@(cat lines "\n")
@(do (vec-push vec #J^{"title" : ~title, "body" : ~lines}))
@(end)
@(do (put-jsonl vec))

我们构建一个哈希向量:与所需 JSON 相对应的底层数据结构。

前缀#J表示嵌入 Lisp 中的 JSON 文字。在这里,我们有 a^表示文字正在被准引用;~字符表示将值插入到模板中的取消引号:标题和根据收集的用换行符连接字符串的行计算正文的表达式。

put-jsonl意思是put-json,后面有一个换行符。默认情况下,在*stdout*流上。

建议使用缩进,如下所示:

@(bind vec @(vec))
@(repeat)
@  title
BLOCK
@  (collect)
@    lines
@  (until)
BLOCK
@  (end)
@  (cat lines "\n")
@  (do (vec-push vec #J^{"title" : ~title, "body" : ~lines}))
@(end)
@(do (put-jsonl vec))

这可以通过某种 awk 来完成; TXR Lisp 中的 Awk 宏:

$ txr data.tl data
[{"title":"A random Title 1","body":"1- a block of text that can contain any character\nand it  also can contain multiple lines"},
 {"title":"A random Title 2","body":"2- a block of text that can contain any character\nand it  also can contain multiple lines"},
 {"title":"A random Title 3","body":"3- a block of text that can contain any character\nand it  also can contain multiple lines"}]

代码:

(awk
  (:set rs "\n\n" fs "\n")
  (:let (vec (vec)))
  ((and (equal [f 1] "BLOCK")
        (equal [f -1] "BLOCK"))
   (vec-push vec #J^{"title":~[f 0], "body":~(cat-str [f 2..-1])})
   (next))
  (t (error "bad data"))
  (:end (put-jsonl vec)))

(:set ...)块用于初始化,我们用它来设置记录分隔符rs和字段分隔符fs,它们类似于原始的 AwkRSFS。使用换行符字段分隔符和双换行符记录分隔符,我们将每个信息块作为一条记录,其字段如下所示:

"title" "BLOCK" "body1" "body2" ... "bodyn" "BLOCK"

在 awk 宏中,字段以名为 的列表形式提供f

主要逻辑是(条件动作)对。条件是:

(and (equal [f 1] "BLOCK") (equal [f -1] "BLOCK"))

f如果 的第二个元素和最后一个元素是字符串,则为 true "BLOCK"。如果这是真的,则执行该操作,该操作会提取片段并在vecJSON 准引号的帮助下添加一个项目,就像第一个程序中一样。我们还执行(next)移动到下一条记录,以避免遇到下一个条件-动作对。

下一个条件-操作对(t (error ...))始终执行,因为t为 true,并引发异常。

我们在块中打印 JSON (:end ..),这就像END { ... }经典的 Awk 中一样。

说到错误检查,第一个程序在一定程度上容忍坏数据;有一些方法可以对其进行微调以拒绝错误的输入。例如,记录之间可能存在被静默跳过的垃圾,如果最后一个结束块丢失,那也没关系。

答案3

在每个 Unix 机器上的任何 shell 中使用任何 awk:

$ cat tst.awk
BEGIN {
    RS = ""
    FS = "\n"
    printf "["
}
{
    gsub(/"/,"\\\\&")

    title = $1
    body  = $3
    for (i=4; i<NF; i++) {
        body = body "\\n" $i
    }

    print  (NR>1 ? "," : "")
    print  "    {"
    printf "        \"title\": \"%s\",\n", title
    printf "        \"body\": \"%s\"\n",   body
    printf "    }"
}
END {
    print "\n]"
}

$ awk -f tst.awk file
[
    {
        "title": "A random Title 1",
        "body": "1- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 2",
        "body": "2- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 3",
        "body": "3- a block of text that can contain any character\nand it  also can contain multiple lines"
    }
]

答案4

Python 与itertools模块一起将输入分组为块/段落,然后使用json带有 dumps 方法的模块以 json 样式打印它们。

python3 -c 'import sys, json, itertools
ifile,rs = sys.argv[1],chr(10)

lod = []
with open(ifile) as fh:
  for k,g in itertools.groupby(fh, lambda x: x == rs):
    if not k:
      para = list(g)
      title,x,*body = list(map(lambda x: x.rstrip(rs),para[0:-1]))
      lod.append({
        "title": title,
        "body": rs.join(body)
      })

print(json.dumps(lod, sort_keys=False, indent=4))
' file

输出:

[
    {
        "title": "A random Title 1",
        "body": "1- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 2",
        "body": "2- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 3",
        "body": "3- a block of text that can contain any character\nand it  also can contain multiple lines"
    }
]

相关内容