处理体内连续的换行符

Question 1

你可以尝试磨坊主

$ mlr --inidx --irs '\n\n' --ifs 'BLOCK' --ojson --jvstack --jlistwrap \
    put -S 'for(k,v in $*){$[k] = strip(v)}' then \
    cut -f 1,2 then \
    rename '1,title,2,body' file
[
{
  "title": "A random Title 1",
  "body": "1- a block of text that can contain any character\nand it  also can contain multiple lines"
}
,{
  "title": "A random Title 2",
  "body": "2- a block of text that can contain any character\nand it  also can contain multiple lines"
}
,{
  "title": "A random Title 3",
  "body": "3- a block of text that can contain any character\nand it  also can contain multiple lines"
}
]

您可以通过管道传输来美化输出jq '.'- 或者省略--jvstack --jlistwrap选项并通过管道传输jq -s '.'：

$ mlr --inidx --irs '\n\n' --ifs 'BLOCK' --ojson \
    put -S 'for(k,v in $*){$[k] = strip(v)}' then \
    cut -f 1,2 then 
    rename '1,title,2,body' file | jq -s '.'
[
  {
    "title": "A random Title 1",
    "body": "1- a block of text that can contain any character\nand it  also can contain multiple lines"
  },
  {
    "title": "A random Title 2",
    "body": "2- a block of text that can contain any character\nand it  also can contain multiple lines"
  },
  {
    "title": "A random Title 3",
    "body": "3- a block of text that can contain any character\nand it  also can contain multiple lines"
  }
]

这cut -f 1,2是唯一必要的，因为第二个BLOCK标记意味着第三个（空）字段 - 如果您愿意，可以用动词替换它remove-empty-columns（尽管后者是非流式传输）。

处理体内连续的换行符

不幸的是，上面没有区分作为输入记录分隔符的连续换行符和可能出现在BLOCK...BLOCK主体分隔符之间的连续换行符。作为解决方法，您可以预处理输入以按\n序列替换正文中的换行符，然后在写出 JSON 之前替换回文字换行符（\n米勒将在其中转义回它们）：

sed '/^BLOCK/{:a;N;/BLOCK$/!ba;s/\n/\\n/g;}' file | 
  mlr --inidx --irs '\n\n' --ifs 'BLOCK' --ojson put -S '$2 = gsub($2,"\\n","\n"); for(k,v in $*){$[k] = strip(v)}' then cut -f 1,2 then rename '1,title,2,body'

您可以将 sed 过滤器作为--prepipe命令传递给 Miller，但引用会变得棘手。

Answer

你可以尝试磨坊主

$ mlr --inidx --irs '\n\n' --ifs 'BLOCK' --ojson --jvstack --jlistwrap \
    put -S 'for(k,v in $*){$[k] = strip(v)}' then \
    cut -f 1,2 then \
    rename '1,title,2,body' file
[
{
  "title": "A random Title 1",
  "body": "1- a block of text that can contain any character\nand it  also can contain multiple lines"
}
,{
  "title": "A random Title 2",
  "body": "2- a block of text that can contain any character\nand it  also can contain multiple lines"
}
,{
  "title": "A random Title 3",
  "body": "3- a block of text that can contain any character\nand it  also can contain multiple lines"
}
]

您可以通过管道传输来美化输出jq '.'- 或者省略--jvstack --jlistwrap选项并通过管道传输jq -s '.'：

$ mlr --inidx --irs '\n\n' --ifs 'BLOCK' --ojson \
    put -S 'for(k,v in $*){$[k] = strip(v)}' then \
    cut -f 1,2 then 
    rename '1,title,2,body' file | jq -s '.'
[
  {
    "title": "A random Title 1",
    "body": "1- a block of text that can contain any character\nand it  also can contain multiple lines"
  },
  {
    "title": "A random Title 2",
    "body": "2- a block of text that can contain any character\nand it  also can contain multiple lines"
  },
  {
    "title": "A random Title 3",
    "body": "3- a block of text that can contain any character\nand it  also can contain multiple lines"
  }
]

这cut -f 1,2是唯一必要的，因为第二个BLOCK标记意味着第三个（空）字段 - 如果您愿意，可以用动词替换它remove-empty-columns（尽管后者是非流式传输）。

处理体内连续的换行符

不幸的是，上面没有区分作为输入记录分隔符的连续换行符和可能出现在BLOCK...BLOCK主体分隔符之间的连续换行符。作为解决方法，您可以预处理输入以按\n序列替换正文中的换行符，然后在写出 JSON 之前替换回文字换行符（\n米勒将在其中转义回它们）：

sed '/^BLOCK/{:a;N;/BLOCK$/!ba;s/\n/\\n/g;}' file | 
  mlr --inidx --irs '\n\n' --ifs 'BLOCK' --ojson put -S '$2 = gsub($2,"\\n","\n"); for(k,v in $*){$[k] = strip(v)}' then cut -f 1,2 then rename '1,title,2,body'

您可以将 sed 过滤器作为--prepipe命令传递给 Miller，但引用会变得棘手。

Question 2

在里面TXR语言，我们可以这样做：

$ txr data.txr data
[{"title":"A random Title 1","body":"1- a block of text that can contain any character\nand it  also can contain multiple lines"},
 {"title":"A random Title 2","body":"2- a block of text that can contain any character\nand it  also can contain multiple lines"},
 {"title":"A random Title 3","body":"3- a block of text that can contain any character\nand it  also can contain multiple lines"}]

其中代码data.txr是：

@(bind vec @(vec))
@(repeat)
@title
BLOCK
@(collect)
@lines
@(until)
BLOCK
@(end)
@(cat lines "\n")
@(do (vec-push vec #J^{"title" : ~title, "body" : ~lines}))
@(end)
@(do (put-jsonl vec))

我们构建一个哈希向量：与所需 JSON 相对应的底层数据结构。

前缀#J表示嵌入 Lisp 中的 JSON 文字。在这里，我们有 a^表示文字正在被准引用；~字符表示将值插入到模板中的取消引号：标题和根据收集的用换行符连接字符串的行计算正文的表达式。

put-jsonl意思是put-json，后面有一个换行符。默认情况下，在*stdout*流上。

建议使用缩进，如下所示：

@(bind vec @(vec))
@(repeat)
@  title
BLOCK
@  (collect)
@    lines
@  (until)
BLOCK
@  (end)
@  (cat lines "\n")
@  (do (vec-push vec #J^{"title" : ~title, "body" : ~lines}))
@(end)
@(do (put-jsonl vec))

这可以通过某种 awk 来完成； TXR Lisp 中的 Awk 宏：

$ txr data.tl data
[{"title":"A random Title 1","body":"1- a block of text that can contain any character\nand it  also can contain multiple lines"},
 {"title":"A random Title 2","body":"2- a block of text that can contain any character\nand it  also can contain multiple lines"},
 {"title":"A random Title 3","body":"3- a block of text that can contain any character\nand it  also can contain multiple lines"}]

代码：

(awk
  (:set rs "\n\n" fs "\n")
  (:let (vec (vec)))
  ((and (equal [f 1] "BLOCK")
        (equal [f -1] "BLOCK"))
   (vec-push vec #J^{"title":~[f 0], "body":~(cat-str [f 2..-1])})
   (next))
  (t (error "bad data"))
  (:end (put-jsonl vec)))

该(:set ...)块用于初始化，我们用它来设置记录分隔符rs和字段分隔符fs，它们类似于原始的 AwkRS和FS。使用换行符字段分隔符和双换行符记录分隔符，我们将每个信息块作为一条记录，其字段如下所示：

"title" "BLOCK" "body1" "body2" ... "bodyn" "BLOCK"

在 awk 宏中，字段以名为的列表形式提供f。

主要逻辑是（条件动作）对。条件是：

(and (equal [f 1] "BLOCK") (equal [f -1] "BLOCK"))

f如果的第二个元素和最后一个元素是字符串，则为 true "BLOCK"。如果这是真的，则执行该操作，该操作会提取片段并在vecJSON 准引号的帮助下添加一个项目，就像第一个程序中一样。我们还执行(next)移动到下一条记录，以避免遇到下一个条件-动作对。

下一个条件-操作对(t (error ...))始终执行，因为t为 true，并引发异常。

我们在块中打印 JSON (:end ..)，这就像END { ... }经典的 Awk 中一样。

说到错误检查，第一个程序在一定程度上容忍坏数据；有一些方法可以对其进行微调以拒绝错误的输入。例如，记录之间可能存在被静默跳过的垃圾，如果最后一个结束块丢失，那也没关系。

Answer

在里面TXR语言，我们可以这样做：

$ txr data.txr data
[{"title":"A random Title 1","body":"1- a block of text that can contain any character\nand it  also can contain multiple lines"},
 {"title":"A random Title 2","body":"2- a block of text that can contain any character\nand it  also can contain multiple lines"},
 {"title":"A random Title 3","body":"3- a block of text that can contain any character\nand it  also can contain multiple lines"}]

其中代码data.txr是：

@(bind vec @(vec))
@(repeat)
@title
BLOCK
@(collect)
@lines
@(until)
BLOCK
@(end)
@(cat lines "\n")
@(do (vec-push vec #J^{"title" : ~title, "body" : ~lines}))
@(end)
@(do (put-jsonl vec))

我们构建一个哈希向量：与所需 JSON 相对应的底层数据结构。

前缀#J表示嵌入 Lisp 中的 JSON 文字。在这里，我们有 a^表示文字正在被准引用；~字符表示将值插入到模板中的取消引号：标题和根据收集的用换行符连接字符串的行计算正文的表达式。

put-jsonl意思是put-json，后面有一个换行符。默认情况下，在*stdout*流上。

建议使用缩进，如下所示：

@(bind vec @(vec))
@(repeat)
@  title
BLOCK
@  (collect)
@    lines
@  (until)
BLOCK
@  (end)
@  (cat lines "\n")
@  (do (vec-push vec #J^{"title" : ~title, "body" : ~lines}))
@(end)
@(do (put-jsonl vec))

这可以通过某种 awk 来完成； TXR Lisp 中的 Awk 宏：

$ txr data.tl data
[{"title":"A random Title 1","body":"1- a block of text that can contain any character\nand it  also can contain multiple lines"},
 {"title":"A random Title 2","body":"2- a block of text that can contain any character\nand it  also can contain multiple lines"},
 {"title":"A random Title 3","body":"3- a block of text that can contain any character\nand it  also can contain multiple lines"}]

代码：

(awk
  (:set rs "\n\n" fs "\n")
  (:let (vec (vec)))
  ((and (equal [f 1] "BLOCK")
        (equal [f -1] "BLOCK"))
   (vec-push vec #J^{"title":~[f 0], "body":~(cat-str [f 2..-1])})
   (next))
  (t (error "bad data"))
  (:end (put-jsonl vec)))

该(:set ...)块用于初始化，我们用它来设置记录分隔符rs和字段分隔符fs，它们类似于原始的 AwkRS和FS。使用换行符字段分隔符和双换行符记录分隔符，我们将每个信息块作为一条记录，其字段如下所示：

"title" "BLOCK" "body1" "body2" ... "bodyn" "BLOCK"

在 awk 宏中，字段以名为的列表形式提供f。

主要逻辑是（条件动作）对。条件是：

(and (equal [f 1] "BLOCK") (equal [f -1] "BLOCK"))

f如果的第二个元素和最后一个元素是字符串，则为 true "BLOCK"。如果这是真的，则执行该操作，该操作会提取片段并在vecJSON 准引号的帮助下添加一个项目，就像第一个程序中一样。我们还执行(next)移动到下一条记录，以避免遇到下一个条件-动作对。

下一个条件-操作对(t (error ...))始终执行，因为t为 true，并引发异常。

我们在块中打印 JSON (:end ..)，这就像END { ... }经典的 Awk 中一样。

说到错误检查，第一个程序在一定程度上容忍坏数据；有一些方法可以对其进行微调以拒绝错误的输入。例如，记录之间可能存在被静默跳过的垃圾，如果最后一个结束块丢失，那也没关系。

Question 3

在每个 Unix 机器上的任何 shell 中使用任何 awk：

$ cat tst.awk
BEGIN {
    RS = ""
    FS = "\n"
    printf "["
}
{
    gsub(/"/,"\\\\&")

    title = $1
    body  = $3
    for (i=4; i<NF; i++) {
        body = body "\\n" $i
    }

    print  (NR>1 ? "," : "")
    print  "    {"
    printf "        \"title\": \"%s\",\n", title
    printf "        \"body\": \"%s\"\n",   body
    printf "    }"
}
END {
    print "\n]"
}

$ awk -f tst.awk file
[
    {
        "title": "A random Title 1",
        "body": "1- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 2",
        "body": "2- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 3",
        "body": "3- a block of text that can contain any character\nand it  also can contain multiple lines"
    }
]

Answer

在每个 Unix 机器上的任何 shell 中使用任何 awk：

$ cat tst.awk
BEGIN {
    RS = ""
    FS = "\n"
    printf "["
}
{
    gsub(/"/,"\\\\&")

    title = $1
    body  = $3
    for (i=4; i<NF; i++) {
        body = body "\\n" $i
    }

    print  (NR>1 ? "," : "")
    print  "    {"
    printf "        \"title\": \"%s\",\n", title
    printf "        \"body\": \"%s\"\n",   body
    printf "    }"
}
END {
    print "\n]"
}

$ awk -f tst.awk file
[
    {
        "title": "A random Title 1",
        "body": "1- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 2",
        "body": "2- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 3",
        "body": "3- a block of text that can contain any character\nand it  also can contain multiple lines"
    }
]

Question 4

Python 与itertools模块一起将输入分组为块/段落，然后使用json带有 dumps 方法的模块以 json 样式打印它们。

python3 -c 'import sys, json, itertools
ifile,rs = sys.argv[1],chr(10)

lod = []
with open(ifile) as fh:
  for k,g in itertools.groupby(fh, lambda x: x == rs):
    if not k:
      para = list(g)
      title,x,*body = list(map(lambda x: x.rstrip(rs),para[0:-1]))
      lod.append({
        "title": title,
        "body": rs.join(body)
      })

print(json.dumps(lod, sort_keys=False, indent=4))
' file

输出：

[
    {
        "title": "A random Title 1",
        "body": "1- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 2",
        "body": "2- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 3",
        "body": "3- a block of text that can contain any character\nand it  also can contain multiple lines"
    }
]

Answer

Python 与itertools模块一起将输入分组为块/段落，然后使用json带有 dumps 方法的模块以 json 样式打印它们。

python3 -c 'import sys, json, itertools
ifile,rs = sys.argv[1],chr(10)

lod = []
with open(ifile) as fh:
  for k,g in itertools.groupby(fh, lambda x: x == rs):
    if not k:
      para = list(g)
      title,x,*body = list(map(lambda x: x.rstrip(rs),para[0:-1]))
      lod.append({
        "title": title,
        "body": rs.join(body)
      })

print(json.dumps(lod, sort_keys=False, indent=4))
' file

输出：

[
    {
        "title": "A random Title 1",
        "body": "1- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 2",
        "body": "2- a block of text that can contain any character\nand it  also can contain multiple lines"
    },
    {
        "title": "A random Title 3",
        "body": "3- a block of text that can contain any character\nand it  also can contain multiple lines"
    }
]

处理体内连续的换行符

答案1

处理体内连续的换行符

答案2

答案3

答案4

相关内容