awk 中两种模式之间的文本处理以提供选择性的唯一输出

awk 中两种模式之间的文本处理以提供选择性的唯一输出

我有以下输入文件,

Policy Name:       KE15-LOCALHOST-APP-RADIX-DAILY

  Policy Type:         Standard
  Active:              yes
  Include:  /appussd
            /home/ussd2ke
            /var/log
            /etc
            /usr

  Schedule:              Montlhy_Full
    Type:                Full Backup
    PFI Recovery:        0
    Maximum MPX:         16
    Retention Level:     5 (3 months)
    Daily Windows:
          Sunday     00:00:00  -->  Sunday     07:00:00
          Monday     00:00:00  -->  Monday     07:00:00
          Tuesday    00:00:00  -->  Tuesday    07:00:00
          Wednesday  00:00:00  -->  Wednesday  07:00:00
          Thursday   00:00:00  -->  Thursday   07:00:00
          Friday     00:00:00  -->  Friday     07:00:00
          Saturday   00:00:00  -->  Saturday   07:00:00

  Schedule:              Weekly_Full
    Type:                Full Backup
    PFI Recovery:        0
    Maximum MPX:         16
    Retention Level:     3 (1 month)
    Daily Windows:
          Wednesday  00:00:00  -->  Wednesday  10:00:00

  Schedule:              Daily_Inc
    Type:                Differential Incremental Backup
    PFI Recovery:        0
    Maximum MPX:         16
    Retention Level:     2 (3 weeks)
    Daily Windows:
          Sunday     01:00:00  -->  Sunday     16:00:00
          Monday     01:00:00  -->  Monday     16:00:00
          Tuesday    01:00:00  -->  Tuesday    16:00:00
          Wednesday  01:00:00  -->  Wednesday  16:00:00
          Thursday   01:00:00  -->  Thursday   16:00:00
          Friday     01:00:00  -->  Friday     16:00:00
          Saturday   01:00:00  -->  Saturday   16:00:00

现在我需要不同对的类型:(低于计划)、保留级别和每日窗口,用逗号和 ; 分隔。对于多个条目。

这是我尝试过的命令,问题在于每日窗口,我能够获取中间的数据并删除每日窗口行,现在我需要删除工作日名称并且只想要唯一的时间段

awk '
  BEGIN { SEP = "" }
  $1 == "Type:" { $1 = ""; T = T SEP $0 }
  $1 == "Retention" && $2 == "Level:" {
    sub(/^.*\(/," ")
    sub(/\).*/,"")
    L = L SEP $0
    if (SEP == "") {
      SEP = ";"
    }
  }
  /Daily Windows:/,/^$/ {
  sub(/^.*Daily.*/,"")
  sub(/^[^A-Z][a-z]+y$/,"")
  S = S SEP $0}
  END {
  sub(/^ */,"",T)
  print T "," L "," S
}'

下面是输出:

Full Backup; Full Backup; Differential Incremental Backup, 3 months; 1 month; 3 weeks,;;          Sunday     00:00:00  -->  Sunday     07:00:00;          Monday     00:00:00  -->  Monday     07:00:00;          Tuesday    00:00:00  -->  Tuesday    07:00:00;          Wednesday  00:00:00  -->  Wednesday  07:00:00;          Thursday   00:00:00  -->  Thursday   07:00:00;          Friday     00:00:00  -->  Friday     07:00:00;          Saturday   00:00:00  -->  Saturday   07:00:00;;;          Wednesday  00:00:00  -->  Wednesday  10:00:00;;;          Sunday     01:00:00  -->  Sunday     16:00:00;          Monday     01:00:00  -->  Monday     16:00:00;          Tuesday    01:00:00  -->  Tuesday    16:00:00;          Wednesday  01:00:00  -->  Wednesday  16:00:00;          Thursday   01:00:00  -->  Thursday   16:00:00;          Friday     01:00:00  -->  Friday     16:00:00;          Saturday   01:00:00  -->  Saturday   16:00:00

但是,所需的输出如下:

Full Backup; Full Backup; Differential Incremental Backup, 3 months; 1 month; 3 weeks, 00:00:00  -->  07:00:00; 00:00:00  -->  10:00:00; 01:00:00  -->  16:00:00

答案1

看起来,如果我们使用可选的冒号:后跟至少两个空格作为 FS ( FS = ":? *"),则可以将该任务中使用的大多数主要字段隔离出来,而不会遇到额外空格的麻烦问题:

$ cat t20.awk
BEGIN { FS=":?   *"; OFS = ", "; SEP = "; "; }

# if $2 is "Type", append $3 to T
$2 == "Type" { T = (T ? T SEP : "") $3;}

# if $2 is "Retention Level", append sub-string in parenthesis to L
$2 == "Retention Level" && match($0, /\(.*?\)/) {
    L = (L ? L SEP : "") substr($0, RSTART+1, RLENGTH-2)
}

# in Daily window block, skip all line without " --> "
# use an associative array "a" to make sure unique time range
/Daily Windows:/,/^\s*$/ {
    if (!/ --> /) next
    key = $3 " --> " $6
    if (!a[key]++) S = (S ? S SEP : "") key
}

END { print T, L, S }

笔记:

  1. 在 中S = (S ? S SEP : "") key,三元组(S ? S SEP : "")是为了在连接字符串时避免前导 SEP,类似于连接T,中的情况L

  2. 在 中substr($0, RSTART+1, RLENGTH-2),使用RSTART+1删除前导(,并RLENGTH-2删除两个括号

运行代码:

$ awk -f t20.awk file.txt
#Full Backup; Full Backup; Differential Incremental Backup, 3 months; 1 month; 3 weeks, 00:00:00 --> 07:00:00; 00:00:00 --> 10:00:00; 01:00:00 --> 16:00:00

更新:

根据您在评论中的描述,我对该Daily Windows部分的代码做了以下调整:

  • 添加了一个标志dw_on来识别开始和结束每日窗户堵塞。应检查所有具有dw_on == 1并匹配该模式的行。每当检测到下一个空行时,该标志将重置为/ --> /S0/^\s*$/
  • 添加了一个变量cnt_DW来计算数量每日窗户每个附表中的条目。这将在每个Daily Windows块的开始处重置

唯一性由哈希(关联数组)维护A,它将在每个块的开头重置Daily Windows。该散列的关键是key = $3 " --> " $6您要检索的窗口。语法:if (!a[key]++) S = (S ? S SEP : "") key与以下相同

  if (!a[key]) { 
      a[key] = a[key] + 1
      S = (S ? S SEP : "") key 
  }

因此,只有当某个键之前没有见过(a[key]="")时,才可以key将 a 附加到S,第二次处理同一键时,它已经存在a[key]==1并将跳过上述代码块。这是awk检查唯一性的常用方法之一。

$ cat t20.1.awk
BEGIN { FS=":?   *"; OFS = ", "; SEP = "; "; }

# if $2 is "Type", append $3 to T
$2 == "Type" { T = (T ? T SEP : "") $3;}

# if $2 is "Retention Level", append sub-string in parenthesis to L
$2 == "Retention Level" && match($0, /\(.*?\)/) {
    L = (L ? L SEP : "") substr($0, RSTART+1, RLENGTH-2)
}

/Daily Windows:/ {
    # turn on the dw_on flag and reset cnt_DW (number of DW entries in a section)
    dw_on = 1; cnt_DW=0;
    # reset the hash 'a' for uniqueness check
    # if you need the uniqueness across all Schedules, then comment it out
    delete a; 
    next;
}

# if dw_on flag is true, i.e. "dw_on == 1"
dw_on {
    # match " --> ", then increase cnt_DW, check the unique window
    # and then append qualified entry to "S"
    if (/ --> /) {
        cnt_DW++
        key = $3 " --> " $6
        if (!a[key]++) S = (S ? S SEP : "") key
    # else if EMPTY line, reset dw_on flag, if cnt_DW is 0, append "No Window" to S
    } else if (/^\s*$/) {
        dw_on = 0;
        if (!cnt_DW) S = (S ? S SEP : "") "No Window"
    }
}

END { 
    # last Schedule section does not have a EMPTY line, so we will need
    # to check up cnt_DW in the last Schedule section in "END" block
    if(dw_on && !cnt_DW) S = (S ? S SEP : "") "No Window";

    # print the result.
    print T, L, S 
}

我对你的原始数据做了以下小修改来测试上面的代码:

  1. Daily Windows删除了第二部分下的Schedule单独条目
  2. Friday 00:00:00 --> Friday 07:00:00将第一个附表中的行替换为Friday 01:00:00 --> Friday 16:00:00与第三个附表部分中相同的行。

所以现在,在第 1 个附表中,有 2 个独特的窗口,在第 2 个附表中,没有窗口,在第 3 个附表中,有 1 个唯一窗口,与第 1 个附表中的窗口相同。

使用上述数据运行更新后的代码,您将得到:

awk -f t20.1.awk file.txt 
#Full Backup; Full Backup; Differential Incremental Backup, 3 months; 1 month; 3 weeks, 00:00:00 --> 07:00:00; 01:00:00 --> 16:00:00; No Window; 01:00:00 --> 16:00:00

请注意,有两个,01:00:00 --> 16:00:00因为它们位于不同的时间表中。如果你想删除最后一个01:00:00 --> 16:00:00,注释掉delete a代码中所示的行,你将得到以下结果:

#Full Backup; Full Backup; Differential Incremental Backup, 3 months; 1 month; 3 weeks, 00:00:00 --> 07:00:00; 01:00:00 --> 16:00:00; No Window

答案2

您可以通过以下方式完成所有操作awk

awk -F'(: *|[)(])' '
    /^ *Type/     { type=type==""?$2 : type ";" $2 }
    /^ *Retention/{ Retention=Retention==""?$3 : Retention ";" $3}
    /^ *Wednesday/{ gsub(/ +Wednesday/,"",$0); day=day==""?$0 : day ";" $0}
END{ print type, Retention, day }' OFS=, infile

您可能需要调整条件部分/ ... /以与字段值更精确地匹配。

输出如您所愿。

相关内容