如何将固定宽度、垂直方向的键值对放入 CSV 文件中?

如何将固定宽度、垂直方向的键值对放入 CSV 文件中?

我有一个包含以下类型内容的文本文件:

OPERATION_CONTEXT VMD1HTE1A71_ns:.oc.GJ_OAD2 alarm_object 1130 On director: VMD1HTE1A71_ns:.temip.VMD1HTE1A71_director AT Fri 18 Oct 2013 06:56:39 All Attributes

                         Identifier = 1130
                              State = Terminated
                     Problem Status = Closed
              Clearance Report Flag = True
                    Escalated Alarm = False
              Close User Identifier = "Auto-Clear"
        Termination User Identifier = "Auto-Clear"
                   Close Time Stamp = Fri 18 Oct 2013 05:01:46
             Termination Time Stamp = Fri 18 Oct 2013 05:01:46
                 Creation Timestamp = Fri 18 Oct 2013 04:37:29
               Clearance Time Stamp = Fri 18 Oct 2013 05:01:40
        Last Modification Timestamp = Fri 18 Oct 2013 05:01:46
                     Previous State = Outstanding
                     Managed Object = Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD "AMS" Node "INGJJMGRJMTSNB0001AG2OLT001"
                    Target Entities = { Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD "AMS" Node "INGJJMGRJMTSNB0001AG2OLT001" }
                         Alarm Type = CommunicationsAlarm
                         Event Time = Fri 18 Oct 2013 05:01:40
                     Probable Cause = Unknown
                  Specific Problems = {  }
            Notification Identifier = 160315
                             Domain = Domain VMD1HTE1A71_ns:.dm.GJ_OAD2
                       Alarm Origin = IncomingAlarm
                 Perceived Severity = Major
                    Additional Text = "
                                                                  nativeProbableCause: Attempt Threshold Crossed
                                                                  osTime: 20131018163727.250+0530
                                                                  neTime: 20131011174021.0+0530
                                                                  notificationId: AMS:160315
                                                                  portNumber:
                                                                  ftpNumber:
                                                                  meNm: INGJJMGRJMTSNB0001AG2OLT001
                                                                  mdNm: AMS
                                                                  objectType: OT_MANAGED_ELEMENT
                                                                  aliasValue: MGMT Security
                                      Access:INGJJMGRJMTSNB0001AG2OLT001:IP10.70.6.6.T0.S841 "
                  Original Severity = Major
                Original Event Time = Fri 11 Oct 2013 05:40:21

我想从此文本文件创建一个 CSV 文件,其中列标题作为每行中 = 之前的值,例如标识符、状态、问题状态以及包含每个列标题下的值的所有下一行,例如 1130、终止、关闭等。从带有“=”的行中,我不希望将任何其他内容提取到 CSV 文件中。

我在此遇到的另一个复杂问题是,某些字段具有换行符,例如附加文本。我想获取“附加文本列”下的一列中“附加文本”的所有值。

由于我是 Linux/Unix 新手,我无法找到一种方法来做到这一点。做这个的最好方式是什么?

答案1

好吧,如果每条记录始终具有相同数量的字段,并且记录之间没有任何内容(我根据您的帖子所做的假设可能正确,也可能不正确),那么您可以采用 awk 路线。这将保留列顺序和嵌入的换行符。假设以下内容位于parse.awk

BEGIN {
    RS       = "( = |\n\\s+)";
    isHeader = 0;
    Sep      = "\",\"";
    Q        = "\"";
    # WinEOL   = "\r"; # enable this if your CSV will be used on Windows
    Headers  = Fields = Q;
}

function sanitise (Entry) {
    gsub(/(^[ "]*|[" \n]*$)/, "", Entry); # Trim leading/trailing double quotes and white space
    gsub(/"/, "\"\"", Entry); # Escape double quotes
    return Entry;
}

function addField (Field) {
    Fields    = Fields FieldsSep sanitise(Field);
    isHeader  = 1;
    FieldsSep = Sep;
    FieldCounter++
}

function addHeader (Header) {
    Headers = Headers HeadersSep sanitise($0);
    isHeader = 0;
    HeadersSep = Sep;
}

1 == NR {                   # Special case of first header
    addHeader($1);
    next;
}

$0 == "\"" {                # Fields with newlines
    LongField    = $0;
    LongFieldSep = "";
    while (getline > 0) {
        LongField    = LongField LongFieldSep $0;
        LongFieldSep = "\n";
        if ($NF ~ /"$/) {
            addField(LongField);
            next;
        }
    }
}
{
    if (isHeader) {
        addHeader($0);
    }
    else {
        addField($0);
    }

    if (FieldsPerRecord == FieldCounter) {
        if (!HeadersPrinted) {
            print Headers Q WinEOL;
            HeadersPrinted = 1
        }
        print Fields Q WinEOL;
        Fields = FieldsSep = "";
        FieldCounter = 0
    }
}

然后你可以FieldsPerRecord在命令行上使用 set 调用它:

$ awk -v FieldsPerRecord=26 -f parse.awk data.csv

这将产生以下 CSV 编码的数据,LibreOffice Calc 似乎可以毫无问题地接受这些数据:

"Identifier","State","Problem Status","Clearance Report Flag","Escalated Alarm","Close User Identifier","Termination User Identifier","Close Time Stamp","Termination Time Stamp","Creation Timestamp","Clearance Time Stamp","Last Modification Timestamp","Previous State","Managed Object","Target Entities","Alarm Type","Event Time","Probable Cause","Specific Problems","Notification Identifier","Domain","Alarm Origin","Perceived Severity","Additional Text","Original Severity","Original Event Time"
"1130","Terminated","Closed","True","False","Auto-Clear","Auto-Clear","Fri 18 Oct 2013 05:01:46","Fri 18 Oct 2013 05:01:46","Fri 18 Oct 2013 04:37:29","Fri 18 Oct 2013 05:01:40","Fri 18 Oct 2013 05:01:46","Outstanding","Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD ""AMS"" Node ""INGJJMGRJMTSNB0001AG2OLT001","{ Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD ""AMS"" Node ""INGJJMGRJMTSNB0001AG2OLT001"" }","CommunicationsAlarm","Fri 18 Oct 2013 05:01:40","Unknown","{  }","160315","Domain VMD1HTE1A71_ns:.dm.GJ_OAD2","IncomingAlarm","Major","nativeProbableCause: Attempt Threshold Crossed
osTime: 20131018163727.250+0530
neTime: 20131011174021.0+0530
notificationId: AMS:160315
portNumber:
ftpNumber:
meNm: INGJJMGRJMTSNB0001AG2OLT001
mdNm: AMS
objectType: OT_MANAGED_ELEMENT
aliasValue: MGMT Security
Access:INGJJMGRJMTSNB0001AG2OLT001:IP10.70.6.6.T0.S841","Major","Fri 11 Oct 2013 05:40:21"

请注意,我采取的是引用一切至少对我来说,这种方法在导入时产生的惊喜较少,但您可以通过设置Q = ""和中的Sep = ","两行来禁用此功能。gsub()sanitise()

然而, 我认为这是一个正则表达式问题。这些数据是固定宽度的,所以看起来珀尔的unpack可能是最好的方法。我一直无法解决这个问题,但这对我来说可能是一个很好的机会,可以了解是否有人想展示一种方法来做到这一点unpack

更新

我不是 Perl Hacker™,但以下内容似乎运行良好,对多行字段的内容不做任何假设,保留字段顺序和字段内的所有原始间距(但删除标题中的前导空格),并且出现 perl -在我未经训练的眼睛中显得很漂亮:

BEGIN{
    our (@headers, @fields);
    our $headers_printed = 0;
}
my ($header, $field) = unpack("A36x2A*", $_); # magic!

if ("" eq $header) {            # Fields with newlines
    $fields[$#fields] .= "\n" . $field;
    next;
}

push(@headers, $header =~ s/^\s*//gr);
push(@fields, $field);

if (26 == $#headers + 1) {      # Print complete record
    printf "%s\n", join ",", @headers  unless $headers_printed;
    $headers_printed = 1;
    printf "%s\n", join ",", @fields;
    @fields = @headers = ();
}

只需调用:

$ perl -nf /tmp/parse.pl /tmp/data.txt
Identifier,State,Problem Status,Clearance Report Flag,Escalated Alarm,Close User Identifier,Termination User Identifier,Close Time Stamp,Termination Time Stamp,Creation Timestamp,Clearance Time Stamp,Last Modification Timestamp,Previous State,Managed Object,Target Entities,Alarm Type,Event Time,Probable Cause,Specific Problems,Notification Identifier,Domain,Alarm Origin,Perceived Severity,Additional Text,Original Severity,Original Event Time
1130,Terminated,Closed,True,False,"Auto-Clear","Auto-Clear",Fri 18 Oct 2013 05:01:46,Fri 18 Oct 2013 05:01:46,Fri 18 Oct 2013 04:37:29,Fri 18 Oct 2013 05:01:40,Fri 18 Oct 2013 05:01:46,Outstanding,Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD "AMS" Node "INGJJMGRJMTSNB0001AG2OLT001",{ Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD "AMS" Node "INGJJMGRJMTSNB0001AG2OLT001" },CommunicationsAlarm,Fri 18 Oct 2013 05:01:40,Unknown,{  },160315,Domain VMD1HTE1A71_ns:.dm.GJ_OAD2,IncomingAlarm,Major,"
                            nativeProbableCause: Attempt Threshold Crossed
                            osTime: 20131018163727.250+0530
                            neTime: 20131011174021.0+0530
                            notificationId: AMS:160315
                            portNumber:
                            ftpNumber:
                            meNm: INGJJMGRJMTSNB0001AG2OLT001
                            mdNm: AMS
                            objectType: OT_MANAGED_ELEMENT
                            aliasValue: MGMT Security
Access:INGJJMGRJMTSNB0001AG2OLT001:IP10.70.6.6.T0.S841 ",Major,Fri 11 Oct 2013 05:40:21

使用 可能会更好Text::CSV,但我更感兴趣的是了解如何unpack工作。对于固定宽度的数据来说,它似乎比正则表达式更具可读性和健壮性。

答案2

或者您可以使用 Perl 的正则表达式子例程:

my $grammar = qr!
    ( ?(DEFINE)
       (?<Identifier> [^=\n]+ )
       (?<Statement>
           (?: # Begin alternation
               " #Opening quotes
               [^"]+? # Any non-quotes (including a new line)
               " # Closing quotes
              | [^\n]+ # Or a single line
           )   # End alternation
        )   

   )

!x;

my $file = do { local $/; <> }; #Slurp file named on command line
my %columns;
while( $file =~ 
   m{ ((?&Identifier))[\t ]*=[ \t]*((?&Statement)) $grammar}xgc )
{ 
   my ($header,$value) = ($1,$2);

       # Remove leading spaces and quote variable if it contains commas:
   for($header,$value) { s/^\s+//mg; /,/ and s/^|$/"/g }

       # Substitute \n with \\n to make multi-line values single-line:
   for($value) { chomp; s/\n/\\n/g }

   $columns{$header}=$value
}

print join "," => sort keys %columns; # Print column headers
print "\n";
print join "," => map { $columns{$_} } sort keys %columns; # Column content
print "\n";

像这样调用它:

[user@host]$ /path/to/script.pl /path/to/file.txt

它将打印 CSV 格式的表到标准输出

这假设多行语句"除了开头和结尾之外不包含双引号 ( )。

答案3

好吧,这不太漂亮,但按照你的要求做。我用 Perl 编写了一个脚本,它将获取上述文件并解析它,然后使用该模块Text::CSV将其转换为 CSV 格式。

剧本

#!/usr/bin/env perl

use Text::CSV;

open(my $fh, "<data.txt");
@lines = <$fh>;
close ($fh);

my (%csv, $name, $val);

foreach my $line (@lines) {
  if ($line =~ m/=/) {
    chomp($line);
        $line =~ s/^\s+//g;
    ($name, $val) = split(/ = /, $line);
        $val =~ s/^"$//;
        $csv{$name} = $val;
  } else {
        $line =~ s/^\s+//g;
        $line =~ s/\s+$/\\n/g;
        $line =~ s/ "\\n$//;
        $csv{$name} .= $line;
  }
}

my @vals;
foreach my $i (sort keys %csv) {
  push(@vals, $csv{$i});
}

my $ccsv = Text::CSV->new();
$ccsv->combine(sort keys %csv);
$ccsv->parse($ccsv->string());
print $ccsv->string() . "\n";
$ccsv->combine(@vals);
$ccsv->parse($ccsv->string());
print $ccsv->string() . "\n";

例子

只需像这样运行它:

$ ./csv.pl
"Additional Text","Alarm Origin","Alarm Type","Clearance Time Stamp","Close Time Stamp","Creation Timestamp",Domain,"Event Time","Last Modification Timestamp","Managed Object","Notification Identifier","Original Event Time","Original Severity","Perceived Severity","Previous State","Probable Cause","Specific Problems","Target Entities","Termination Time Stamp"
"nativeProbableCause: Attempt Threshold Crossed\nosTime: 20131018163727.250+0530\nneTime: 20131011174021.0+0530\nnotificationId: AMS:160315\nportNumber:\nftpNumber:\nmeNm: INGJJMGRJMTSNB0001AG2OLT001\nmdNm: AMS\nobjectType: OT_MANAGED_ELEMENT\naliasValue: MGMT Security\nAccess:INGJJMGRJMTSNB0001AG2OLT001:IP10.70.6.6.T0.S841",IncomingAlarm,CommunicationsAlarm,"Fri 18 Oct 2013 05:01:40","Fri 18 Oct 2013 05:01:46","Fri 18 Oct 2013 04:37:29","Domain VMD1HTE1A71_ns:.dm.GJ_OAD2","Fri 18 Oct 2013 05:01:40","Fri 18 Oct 2013 05:01:46","Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD ""AMS"" Node ""INGJJMGRJMTSNB0001AG2OLT001""",160315,"Fri 11 Oct 2013 05:40:21",Major,Major,Outstanding,Unknown,"{  }","{ Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD ""AMS"" Node ""INGJJMGRJMTSNB0001AG2OLT001"" }","Fri 18 Oct 2013 05:01:46"

让我知道您的想法,或者如果您在运行时遇到问题。如果它满足您的需要,我将填写其工作原理的详细信息。

参考

相关内容