从 text+tsv 文件创建 json

从 text+tsv 文件创建 json

我是个初学者,一生中从未做过 JSon 只解析了一些 json 文本文件:
类别数量可能会变化

Category1
1111|2222|3333|444|5555
6666|7777|8888|9999|1111
0000|1111|22222|3333|4444
Category2
2222|3333||5555
2222|3333|4444||5555
Category3
AAAA|||BBBB
CCCC|DDDD|EEEE|FFFF|GGGG

预期的 JSON:类别 ---> 数组 ---> 对于每个 tsv -->Make Key:Value(Year|Title|Name|Format|Base) 如果 tsv 值不适合以上任何人,则添加 null Json 应该是这样的,我可以对每个类别进行数字索引并重新索引类别以获得 tsv 数组

Json 结构应该是这样的: 例如。

1) I could get list of category using jq
cat file | jq
Output:
Category1
Category2
Category3
2) Index that Category
jq '.| Category[0]'
Output:
Category1
1111|2222|3333|444|5555
6666|7777|8888|9999|1111
0000|1111|22222|3333|4444


3) Index each tsv 
jq Category1 | jq '.[0]' ---> 
{
"Year": "1111"
"Title": "2222"
"Name": "3333"
"Format": "4444"
"Base": "5555"
}

jq'.|类别[0]

答案1

我不太知道您期望最终的 JSON 文档是什么样子,但以下内容使用 Miller ( mlr) 和,根据您的 CSV 输入文件创建了一个 JSON 文档jq

mlr --c2j -N --ifs pipe --ragged \
        label Year,Title,Name,Format,Base then \
        put '
                if (NF == 1) {
                        @Category = string($Year)
                } else {
                        $Category = @Category
                }' then \
        filter -x 'is_absent($Category)' then \
        unsparsify file.csv |
jq -n 'reduce inputs as $a ({}; .[$a.Category] += [$a])'

这首先按照您在问题中提到的顺序添加您提到的标题。Year如果记录中只有一个字段,它会记住该字段的值(第一个字段)。Category如果有多个字段,则使用记住的值创建该字段。

然后,我们删除所有没有该Category字段的记录,最后向任何缺失的字段添加空值。

我们现在有一组 JSON 对象:

{ "Year": 1111, "Title": 2222, "Name": 3333, "Format": 444, "Base": 5555, "Category": "Category1" }
{ "Year": 6666, "Title": 7777, "Name": 8888, "Format": 9999, "Base": 1111, "Category": "Category1" }
{ "Year": 0000, "Title": 1111, "Name": 22222, "Format": 3333, "Base": 4444, "Category": "Category1" }
{ "Year": 2222, "Title": 3333, "Name": "", "Format": 5555, "Base": "", "Category": "Category2" }
{ "Year": 2222, "Title": 3333, "Name": 4444, "Format": "", "Base": 5555, "Category": "Category2" }
{ "Year": "AAAA", "Title": "", "Name": "", "Format": "BBBB", "Base": "", "Category": "Category3" }
{ "Year": "CCCC", "Title": "DDDD", "Name": "EEEE", "Format": "FFFF", "Base": "GGGG", "Category": "Category3" }

然后由 进行处理jq,它将根据字段中的值将对象收集到数组中Category

根据给定的数据,这会产生一些东西相当于

{
   "Category1": [
      { "Base": 5555, "Category": "Category1", "Format": 444, "Name": 3333, "Title": 2222, "Year": 1111 },
      { "Base": 1111, "Category": "Category1", "Format": 9999, "Name": 8888, "Title": 7777, "Year": 6666 },
      { "Base": 4444, "Category": "Category1", "Format": 3333, "Name": 22222, "Title": 1111, "Year": 0 }
   ],
   "Category2": [
      { "Base": "", "Category": "Category2", "Format": 5555, "Name": "", "Title": 3333, "Year": 2222 },
      { "Base": 5555, "Category": "Category2", "Format": "", "Name": 4444, "Title": 3333, "Year": 2222 }
   ],
   "Category3": [
      { "Base": "", "Category": "Category3", "Format": "BBBB", "Name": "", "Title": "", "Year": "AAAA" },
      { "Base": "GGGG", "Category": "Category3", "Format": "FFFF", "Name": "EEEE", "Title": "DDDD", "Year": "CCCC" }
   ]
}

通过这个jq '.Category1[0]'会给你

{
  "Year": 1111,
  "Title": 2222,
  "Name": 3333,
  "Format": 444,
  "Base": 5555,
  "Category": "Category1"
}

答案2

对于标准工具来说,您的数据中有太多自由形式。我建议使用 Perl,例如:

#!/bin/perl
use JSON::PP;

open(IN, "<file.tsv");
while(<IN>) {
  chomp;
  if (/Category1/) {
     $group = $_;
     @heads = ("Year", "Title", "Name", "Format", "Base");
  } elsif (/Category2/) {
     $group = "new name for Category2";
     @heads = ("Year", "Title", "Name");
  } elsif (/Category3/) {
     $group = $_;
     @heads = ("ABC", "EFD");
  } else {
     my @columns = split /\|/;
     my %row = ();
     for (my $i=0; $i < scalar(@heads); $i++) {
        $row{$heads[$i]} = $columns[$i];
     }
     push @{$data{$group}}, \%row;

  }
}
close(IN);

print encode_json \%data;

答案3

这是使用csplit和执行此操作的一种方法jq

您可以将数据拆分到临时文件中,如下所示:

csplit -z infile.csv '/^Category/' '{*}'

将文件转换xx*为 JSON 对象:

for f in xx*; do
  <$f jq -sRrc '
  split("\n")
  | .[0] as $category
  | .[1:-1]
  | map(split("|"))
  | map({"Year"  : .[0],
         "Title" : .[1],
         "Name"  : .[2],
         "Format": .[3],
         "Base"  : .[4],
        }) | { ($category): . }
  '
done

输出:

{"Category1":[{"Year":"1111","Title":"2222","Name":"3333","Format":"444","Base":"5555"},{"Year":"6666","Title":"7777","Name":"8888","Format":"9999","Base":"1111"},{"Year":"0000","Title":"1111","Name":"22222","Format":"3333","Base":"4444"}]}
{"Category2":[{"Year":"2222","Title":"3333","Name":"","Format":"5555","Base":null},{"Year":"2222","Title":"3333","Name":"4444","Format":"","Base":"5555"}]}
{"Category3":[{"Year":"AAAA","Title":"","Name":"","Format":"BBBB","Base":null},{"Year":"CCCC","Title":"DDDD","Name":"EEEE","Format":"FFFF","Base":"GGGG"}]}

您可以获得这样的类别:

jq -r 'keys[]'

输出:

Category1
Category2
Category3

如果您想要特定类别,例如:

n=2; cat xx$(printf "%02d" $((n-1)))

输出:

Category2
2222|3333||5555
2222|3333|4444||5555

按索引获取特定条目,例如第二个对象Category2

jq '.Category2 // empty | .[1]'

输出:

{
  "Year": "2222",
  "Title": "3333",
  "Name": "4444",
  "Format": "",
  "Base": "5555"
}

答案4

#!/usr/bin/perl

use JSON;
use strict;

my @fields = qw(Year Title Name Format Base);
my (%data, $category, $row);

while(<>) {
  chomp;
  my @line = split /\|/;

  if ($#line == 0) {
    # there are no field separators on this line, so
    # this must be a "category" line.
    $category = $_;
    $row = 0;

  } else {
    foreach my $i (0..$#fields) {
      # treat empty strings as null values
      undef $line[$i] if ($line[$i] eq "");

      $data{$category}[$row]{$fields[$i]} = $line[$i]
    };
    $row++;
  };
};

my $json = JSON->new->pretty->canonical;
print $json->encode(\%data) ;

顺便说一句,“pretty”在 json 字符串中启用换行和缩进。 'canonical' 导致键按排序顺序输出(否则每次运行时它们都会以半随机顺序输出,因为 perl 哈希本质上是无序的)。默认情况下,两者均处于禁用状态JSON模块,因为它们会产生额外的处理开销(即它们会减慢速度),并且仅在人类要读取输出时才需要 - 处理 json 数据的程序不关心键的顺序。

使用示例数据作为输入进行输出:

{
   "Category1" : [
      {
         "Base" : "5555",
         "Format" : "444",
         "Name" : "3333",
         "Title" : "2222",
         "Year" : "1111"
      },
      {
         "Base" : "1111",
         "Format" : "9999",
         "Name" : "8888",
         "Title" : "7777",
         "Year" : "6666"
      },
      {
         "Base" : "4444",
         "Format" : "3333",
         "Name" : "22222",
         "Title" : "1111",
         "Year" : "0000"
      }
   ],
   "Category2" : [
      {
         "Base" : null,
         "Format" : "5555",
         "Name" : null,
         "Title" : "3333",
         "Year" : "2222"
      },
      {
         "Base" : "5555",
         "Format" : null,
         "Name" : "4444",
         "Title" : "3333",
         "Year" : "2222"
      }
   ],
   "Category3" : [
      {
         "Base" : null,
         "Format" : "BBBB",
         "Name" : null,
         "Title" : null,
         "Year" : "AAAA"
      },
      {
         "Base" : "GGGG",
         "Format" : "FFFF",
         "Name" : "EEEE",
         "Title" : "DDDD",
         "Year" : "CCCC"
      }
   ]
}

相关内容