将数据格式化为表格

Question 1

在每个 Unix 机器上的任何 shell 中使用任何 awk：

$ cat tst.awk
BEGIN {
    numTags = split("Name City Age Couse",nums2tags)
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = nums2tags[tagNr]
        tags2nums[tag] = tagNr
        wids[tagNr] = ( length(tag) > length("null") ? length(tag) : length("null") )
    }
    OFS=" | "
}
(NR==1) || (prevTag=="Couse") {
    numRecs++
}
{
    gsub(/^"|"$/,"")
    tag = val = $0
    sub(/".*/,"",tag)
    sub(/[^"]+":"/,"",val)

    tagNr = tags2nums[tag]
    vals[numRecs,tagNr] = val

    wid = length(val)
    wids[tagNr] = ( wid > wids[tagNr] ? wid : wids[tagNr] )

    prevTag = tag
}
END {
    # Uncomment these 3 lines if youd like a header line printed:
    # for (tagNr=1; tagNr<=numTags; tagNr++) {
    #   printf "%-*s%s", wids[tagNr], nums2tags[tagNr], (tagNr<numTags ? OFS : ORS)
    # }

    for (recNr=1; recNr<=numRecs; recNr++) {
        for (tagNr=1; tagNr<=numTags; tagNr++) {
            val = ( (recNr,tagNr) in vals ? vals[recNr,tagNr] : "null" )
            printf "%-*s%s", wids[tagNr], val, (tagNr<numTags ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
asxadadad  ,aaf dsf | Mum  | 23  | BBS
null                | Ors  | 11  | MB
adad sf             | Kol  | 21  | BB
pqr                 | null | 21  | NN

或者如果您不想使用硬编码的标签列表（字段/列名称）：

$ cat tst.awk
BEGIN { OFS=" | " }
(NR==1) || (prevTag=="Couse") {
    numRecs++
}
{
    gsub(/^"|"$/,"")
    tag = val = $0
    sub(/".*/,"",tag)
    sub(/[^"]+":"/,"",val)

    if ( !(tag in tags2nums) ) {
        tagNr = ++numTags
        tags2nums[tag] = tagNr
        nums2tags[tagNr] = tag
        wids[tagNr] = ( length(tag) > length("null") ? length(tag) : length("null") )
    }

    tagNr = tags2nums[tag]
    vals[numRecs,tagNr] = val

    wid = length(val)
    wids[tagNr] = ( wid > wids[tagNr] ? wid : wids[tagNr] )

    prevTag = tag
}
END {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        printf "%-*s%s", wids[tagNr], nums2tags[tagNr], (tagNr<numTags ? OFS : ORS)
    }

    for (recNr=1; recNr<=numRecs; recNr++) {
        for (tagNr=1; tagNr<=numTags; tagNr++) {
            val = ( (recNr,tagNr) in vals ? vals[recNr,tagNr] : "null" )
            printf "%-*s%s", wids[tagNr], val, (tagNr<numTags ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
Name                | City | Age | Couse
asxadadad  ,aaf dsf | Mum  | 23  | BBS
null                | Ors  | 11  | MB
adad sf             | Kol  | 21  | BB
pqr                 | null | 21  | NN

请注意，第二个脚本的输出中的列顺序将是这些标签在输入中出现的顺序，这就是为什么它们需要标题行来标识值，除非保证所有标签都按照您在输入中出现的顺序想要他们输出。

Answer

在每个 Unix 机器上的任何 shell 中使用任何 awk：

$ cat tst.awk
BEGIN {
    numTags = split("Name City Age Couse",nums2tags)
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = nums2tags[tagNr]
        tags2nums[tag] = tagNr
        wids[tagNr] = ( length(tag) > length("null") ? length(tag) : length("null") )
    }
    OFS=" | "
}
(NR==1) || (prevTag=="Couse") {
    numRecs++
}
{
    gsub(/^"|"$/,"")
    tag = val = $0
    sub(/".*/,"",tag)
    sub(/[^"]+":"/,"",val)

    tagNr = tags2nums[tag]
    vals[numRecs,tagNr] = val

    wid = length(val)
    wids[tagNr] = ( wid > wids[tagNr] ? wid : wids[tagNr] )

    prevTag = tag
}
END {
    # Uncomment these 3 lines if youd like a header line printed:
    # for (tagNr=1; tagNr<=numTags; tagNr++) {
    #   printf "%-*s%s", wids[tagNr], nums2tags[tagNr], (tagNr<numTags ? OFS : ORS)
    # }

    for (recNr=1; recNr<=numRecs; recNr++) {
        for (tagNr=1; tagNr<=numTags; tagNr++) {
            val = ( (recNr,tagNr) in vals ? vals[recNr,tagNr] : "null" )
            printf "%-*s%s", wids[tagNr], val, (tagNr<numTags ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
asxadadad  ,aaf dsf | Mum  | 23  | BBS
null                | Ors  | 11  | MB
adad sf             | Kol  | 21  | BB
pqr                 | null | 21  | NN

或者如果您不想使用硬编码的标签列表（字段/列名称）：

$ cat tst.awk
BEGIN { OFS=" | " }
(NR==1) || (prevTag=="Couse") {
    numRecs++
}
{
    gsub(/^"|"$/,"")
    tag = val = $0
    sub(/".*/,"",tag)
    sub(/[^"]+":"/,"",val)

    if ( !(tag in tags2nums) ) {
        tagNr = ++numTags
        tags2nums[tag] = tagNr
        nums2tags[tagNr] = tag
        wids[tagNr] = ( length(tag) > length("null") ? length(tag) : length("null") )
    }

    tagNr = tags2nums[tag]
    vals[numRecs,tagNr] = val

    wid = length(val)
    wids[tagNr] = ( wid > wids[tagNr] ? wid : wids[tagNr] )

    prevTag = tag
}
END {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        printf "%-*s%s", wids[tagNr], nums2tags[tagNr], (tagNr<numTags ? OFS : ORS)
    }

    for (recNr=1; recNr<=numRecs; recNr++) {
        for (tagNr=1; tagNr<=numTags; tagNr++) {
            val = ( (recNr,tagNr) in vals ? vals[recNr,tagNr] : "null" )
            printf "%-*s%s", wids[tagNr], val, (tagNr<numTags ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
Name                | City | Age | Couse
asxadadad  ,aaf dsf | Mum  | 23  | BBS
null                | Ors  | 11  | MB
adad sf             | Kol  | 21  | BB
pqr                 | null | 21  | NN

请注意，第二个脚本的输出中的列顺序将是这些标签在输入中出现的顺序，这就是为什么它们需要标题行来标识值，除非保证所有标签都按照您在输入中出现的顺序想要他们输出。

Question 2

在 Perl 中。我会添加更多关于它的作用/工作原理的解释，但我认为代码中的注释涵盖了所有这些。

#!/usr/bin/perl

use strict;

my @people; # Array-of-Arrays (AoA) to hold each record
my %person; # hash to hold the current record as it's being read in.

# list of valid field names, in the order you want them printed
my @names   = qw(Name City Age Couse);
my $end_key = 'Couse';

# build a regex from the valid names
my $names    = join('|',@names);
my $names_re = qr/^(?:$names)$/;

# Initialise field widths, with a minimum of 4 (for 'null').
my %widths = map {$_ => (length > 4 ? length : 4) } @names;

while(<>) {
  chomp;

  s/^"|"$//g;                       # strip leading and trailing quotes
  my ($key,$val) = split /"?:"?/;   # split on :, with optional quotes.

  if ($key =~ m/$names_re/) {
    $widths{$key} = length($val) if ($widths{$key} < length($val) );

    $person{$key} = $val;

    if ($key eq $end_key) {
      # push an array into the @people array, containing the values of
      # the valid fields, in order.  Use null as the default value
      # if any field is empty/undefined.
      push @people, [ map { $person{$_} || 'null' } @names ];
      %person = ();
    };
  } else {
    print STDERR "Error on input line $.: unrecognised data\n";
  };
};

# build a printf format string, using the longest width of each field.
my $fmt = join(' | ', map { "%-$widths{$_}s" } @names) . "\n";

# optional header line, comment out if not wanted
printf $fmt, @names;

# optional ruler line, comment out if not wanted
print join('-|-', map { '-' x $widths{$_} } @names) . "\n";

foreach my $p (@people) {
  printf $fmt, @{ $p };
}

另存为，例如，columns.pl并使用 chmod +x 使其可执行。

输出：

$ chmod +x columns.pl 
$ ./columns.pl demo.txt 
Name                | City | Age  | Couse
--------------------|------|------|------
asxadadad  ,aaf dsf | Mum  | 23   | BBS 
null                | Ors  | 11   | MB  
adad sf             | Kol  | 21   | BB  
pqr                 | null | 21   | NN

Answer

在 Perl 中。我会添加更多关于它的作用/工作原理的解释，但我认为代码中的注释涵盖了所有这些。

#!/usr/bin/perl

use strict;

my @people; # Array-of-Arrays (AoA) to hold each record
my %person; # hash to hold the current record as it's being read in.

# list of valid field names, in the order you want them printed
my @names   = qw(Name City Age Couse);
my $end_key = 'Couse';

# build a regex from the valid names
my $names    = join('|',@names);
my $names_re = qr/^(?:$names)$/;

# Initialise field widths, with a minimum of 4 (for 'null').
my %widths = map {$_ => (length > 4 ? length : 4) } @names;

while(<>) {
  chomp;

  s/^"|"$//g;                       # strip leading and trailing quotes
  my ($key,$val) = split /"?:"?/;   # split on :, with optional quotes.

  if ($key =~ m/$names_re/) {
    $widths{$key} = length($val) if ($widths{$key} < length($val) );

    $person{$key} = $val;

    if ($key eq $end_key) {
      # push an array into the @people array, containing the values of
      # the valid fields, in order.  Use null as the default value
      # if any field is empty/undefined.
      push @people, [ map { $person{$_} || 'null' } @names ];
      %person = ();
    };
  } else {
    print STDERR "Error on input line $.: unrecognised data\n";
  };
};

# build a printf format string, using the longest width of each field.
my $fmt = join(' | ', map { "%-$widths{$_}s" } @names) . "\n";

# optional header line, comment out if not wanted
printf $fmt, @names;

# optional ruler line, comment out if not wanted
print join('-|-', map { '-' x $widths{$_} } @names) . "\n";

foreach my $p (@people) {
  printf $fmt, @{ $p };
}

另存为，例如，columns.pl并使用 chmod +x 使其可执行。

输出：

$ chmod +x columns.pl 
$ ./columns.pl demo.txt 
Name                | City | Age  | Couse
--------------------|------|------|------
asxadadad  ,aaf dsf | Mum  | 23   | BBS 
null                | Ors  | 11   | MB  
adad sf             | Kol  | 21   | BB  
pqr                 | null | 21   | NN

Question 3

一个简短的 GNU awk 兼容（对于定义为正则表达式的 RS）解决方案，它与记录块中标签所在的位置分开工作，并且在打印时我们遵循输入中的标签顺序；和最后一个标签Couse是记录结束标识符：

<infile awk -F'\n' -v tags='Name,City,Age,Couse' '
BEGIN{ tagsNum=split(tags, tgs, ","); RS="\n?\""tgs[tagsNum]"\":[^\n]*\n" }

function tbl(tag, field) {
    if(index(field, "\""tag"\"")==1 && !key[tag]++ || field==RT){
        gsub(/(^[^:]*:"|"\n?)/, "", field)
        key[tag]=field
    }
}
{ for(i=1; i<=NF; i++){ for(k in tgs) tbl(tgs[k], $i); tbl(RT, RT) }
  for(i=1; i<tagsNum; i++)
      printf "%s", (key[tgs[i]]!=""? key[tgs[i]]:"null") OFS; print key[RT]
  delete key
}' OFS='@|' |column -ts'@'

通过为数组中的每个标签名称调用该函数tgs，我们通过匹配它们出现的相关字段来重新填充它们的值，然后我们为每个记录打印它们，null如果它们没有任何值，则删除数组并对下一个块执行相同的操作，依此类推。

我们使用来column -ts'@'表格化输出，该@字符来自OFS='@|'，使用这种方式column将根据该字符调整输出字段，稍后将从输出中删除，因此假设该@字符不应出现在您的输入数据中（如果可以的话，换成另一个字符）。如果你有column从util-linux包装中，你可以改变OFS='@|' |column -ts'@'到OFS='|' |column -t -s'|' -o' | '。

asxadadad  ,aaf dsf  |Mum   |23  |BBS
null                 |Ors   |11  |MB
adad sf              |Kol   |21  |BB
pqr                  |null  |21  |NN

Answer

一个简短的 GNU awk 兼容（对于定义为正则表达式的 RS）解决方案，它与记录块中标签所在的位置分开工作，并且在打印时我们遵循输入中的标签顺序；和最后一个标签Couse是记录结束标识符：

<infile awk -F'\n' -v tags='Name,City,Age,Couse' '
BEGIN{ tagsNum=split(tags, tgs, ","); RS="\n?\""tgs[tagsNum]"\":[^\n]*\n" }

function tbl(tag, field) {
    if(index(field, "\""tag"\"")==1 && !key[tag]++ || field==RT){
        gsub(/(^[^:]*:"|"\n?)/, "", field)
        key[tag]=field
    }
}
{ for(i=1; i<=NF; i++){ for(k in tgs) tbl(tgs[k], $i); tbl(RT, RT) }
  for(i=1; i<tagsNum; i++)
      printf "%s", (key[tgs[i]]!=""? key[tgs[i]]:"null") OFS; print key[RT]
  delete key
}' OFS='@|' |column -ts'@'

通过为数组中的每个标签名称调用该函数tgs，我们通过匹配它们出现的相关字段来重新填充它们的值，然后我们为每个记录打印它们，null如果它们没有任何值，则删除数组并对下一个块执行相同的操作，依此类推。

我们使用来column -ts'@'表格化输出，该@字符来自OFS='@|'，使用这种方式column将根据该字符调整输出字段，稍后将从输出中删除，因此假设该@字符不应出现在您的输入数据中（如果可以的话，换成另一个字符）。如果你有column从util-linux包装中，你可以改变OFS='@|' |column -ts'@'到OFS='|' |column -t -s'|' -o' | '。

asxadadad  ,aaf dsf  |Mum   |23  |BBS
null                 |Ors   |11  |MB
adad sf              |Kol   |21  |BB
pqr                  |null  |21  |NN

Question 4

该数据看起来好像已经从最初的某个 JSON 文档进行了修改。

让我们恢复 JSON 文档结构

添加[{到文档的开头和}]结尾，
添加},{到以确切字符串开头的每行末尾"Couse"（但不是最后一行），以及
在每行的末尾添加逗号，该行的末尾未进行其他修改（即，该行末尾仍然有双引号）。

sed -e '1 s/^/[{/' -e '$ s/$/}]/' \
    -e '/^"Couse"/ { $! s/$/},{/; }' \
    -e 's/"$/&,/' file

通过漂亮打印，这会将我们的文档变成

[
  {
    "Name": "asxadadad  ,aaf dsf",
    "City": "Mum",
    "Age": "23",
    "Couse": "BBS"
  },
  {
    "City": "Ors",
    "Age": "11",
    "Couse": "MB"
  },
  {
    "Name": "adad sf",
    "City": "Kol",
    "Age": "21",
    "Couse": "BB"
  },
  {
    "Name": "pqr",
    "Age": "21",
    "Couse": "NN"
  }
]

然后，我们可以通过管道将其转换为 CSV jq（添加一些列标题并用字符串替换空值null）：

jq -r '    [ "Name", "City", "Age", "Couse" ],
    (.[] | [ .Name,  .City,  .Age,  .Couse  ]) |
    map(. // "null") | @csv'

这会生成

"Name","City","Age","Couse"
"asxadadad  ,aaf dsf","Mum","23","BBS"
"null","Ors","11","MB"
"adad sf","Kol","21","BB"
"pqr","null","21","NN"

然后我们可以使用工具包csvlook中的内容csvkit来生成一个漂亮的表格。

最终的管道看起来像

sed -e '1 s/^/[{/' -e '$ s/$/}]/' \
    -e '/^"Couse"/ { $! s/$/},{/; }' \
    -e 's/"$/&,/' file |
jq -r '    [ "Name", "City", "Age", "Couse" ],
    (.[] | [ .Name,  .City,  .Age,  .Couse  ]) |
    map(. // "null") | @csv' |
csvlook --blanks

我们使用csvlook它的--blanks选项来使其保持null字符串原样（否则它会删除这些字符串）。

结果将是

| Name                | City | Age | Couse |
| ------------------- | ---- | --- | ----- |
| asxadadad  ,aaf dsf | Mum  |  23 | BBS   |
| null                | Ors  |  11 | MB    |
| adad sf             | Kol  |  21 | BB    |
| pqr                 | null |  21 | NN    |

或者，呈现为 markdown：

姓名	城市	年龄	库斯
阿斯达达德，aaf dsf	妈妈	23	论坛
无效的	奥尔斯	11	MB
阿达德SF	科尔	21	BB
普格	无效的	21	神经网络

Answer