合并具有一个公共字段的多行数据

合并具有一个公共字段的多行数据

我有一个大文件,其中包含 9 行以上的数据,并用分号 (;) 分隔,并且我想合并第 3 列中的数据(用 , 分隔)与第 5 列中的数据匹配的行。数据保存在 Linux 机器上,并具有常用的 awk/perl 工具,但不知道如何使用它们

文件:

Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
orion.uk.localhost.com;XY01123;Machine-apache-ua01;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
orion.uk.localhost.com;XY01123;Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-bcp1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00;16 April 2018 07:36:33 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat2;uat;uat.matrix.localhost.com;22 March 2013 06:16:10 GMT;22 March 2018 06:46:10 GMT;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal

预期输出:

Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-bcp1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00;16 April 2018 07:36:33 GMT+01:00;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal

任何有关如何合并的想法都将不胜感激。

答案1

也许还有更优雅的方式awk,但这里是一个可能的脚本。

BEGIN { FS=";" ; OFS=";" }
NR==1 { print $0 }
NR>1 {
    if ( b[$5]=="" ) {
        a[$5]=$0
        b[$5]=$3
    }
    else {
        b[$5]=b[$5]","$3
        $3=b[$5]
        a[$5]=$0
    }
}
END {
    for (c in a) {
        print a[c]
    }
}

解释:

  • BEGIN设置分号作为输入和输出字段分隔符
  • NR==1只需打印第一行(标题),无需执行任何操作
  • NR>1对于其他线路:
    • b[$5]是一个由字段 5 值索引的数组,包含字段 3 条目的(不断增长的)逗号分隔列表
    • a[$5]是一个由字段 5 值索引的数组,包含修改的行(即包含以逗号分隔的字段 3 值)
    • 如果b[$5]未设置(该值第一次出现),则设置a[$5]为行和b[$5]字段 3
    • 否则(b[$5]已设置),将带有逗号分隔符的字段 3 添加到b[$5],将此行中的字段 3 替换为此,然后替换a[$5]为此更改的行
    • ENDc对于数组的所有索引值a打印数组元素(即所需的行)

我真的不知道如何awk对输出进行排序,但这是我的结果:

Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;22 March 2013 06:16:10 GMT;22 March 2018 06:46:10 GMT;1024;External
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-bcp1,Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External

答案2

sqlite那里有吗?我对如何加入线路的了解正确吗?

sqlite> .separator ;
sqlite> .import file.txt alldata
sqlite> select "ENV URL", group_concat("Machine") from alldata group by "ENV URL";
dev.matrix.localhost.com;Machine-apache-dev1,Machine-apache-dev2
per.Upgrade.uk.localhost.com;Machine-apache-pf01,Machine-apache-pf02
test.matrix.localhost.com;Machine-apache-bcp1,Machine-apache-prd1
uat.matrix.localhost.com;Machine-apache-uat1,Machine-apache-uat2
uat.orion.uk.localhost.com;Machine-apache-ua01,Machine-apache-ua02

或者非交互式:

echo 'select "ENV URL", group_concat("Machine") from alldata group by "ENV URL";' \
  | sqlite3 -separator ";" -cmd ".import file.txt alldata" -batch

答案3

在 perl 中使用数组散列(在每次合并后使用拼接来删除并重新插入聚合字段):

$ perl -F\; -alne '

  if($.==1) {
    print;
    next;
  }

  if(!exists $HoA{$F[4]}) {
    $HoA{$F[4]} = [ @F ];
  }
  else {
    splice @{ $HoA{$F[4]} }, 2, 0, join ",", (splice @{ $HoA{$F[4]} }, 2, 1), $F[2];
  }

  END {
    for $k (keys %HoA) {
      print join ";", @{ $HoA{$k} };
    }
  }
  ' data
Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
matrix.localhost.com;XY6124;Machine-apache-bcp1,Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00;16 April 2018 07:36:33 GMT+01:00;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal

或者,使用 GNU datamash(使用 acut删除多余的groupby字段):

$ datamash -Hst\; groupby 5 unique 1-2 collapse 3 unique 4-9 < data | cut -d\; -f2-
unique(Domain Name);unique(ID);collapse(Machine);unique(Environment);unique(ENV URL);unique(Start Date);unique(End Date);unique(Disk Size);unique(Used)
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-bcp1,Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00,22 March 2013 06:16:10 GMT;16 April 2018 07:36:33 GMT+01:00,22 March 2018 06:46:10 GMT;1024;External
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal

相关内容