awk 中两种模式之间的文本处理以提供选择性的唯一输出

Question 1

看起来，如果我们使用可选的冒号:后跟至少两个空格作为 FS ( FS = ":? *")，则可以将该任务中使用的大多数主要字段隔离出来，而不会遇到额外空格的麻烦问题：

$ cat t20.awk
BEGIN { FS=":?   *"; OFS = ", "; SEP = "; "; }

# if $2 is "Type", append $3 to T
$2 == "Type" { T = (T ? T SEP : "") $3;}

# if $2 is "Retention Level", append sub-string in parenthesis to L
$2 == "Retention Level" && match($0, /\(.*?\)/) {
    L = (L ? L SEP : "") substr($0, RSTART+1, RLENGTH-2)
}

# in Daily window block, skip all line without " --> "
# use an associative array "a" to make sure unique time range
/Daily Windows:/,/^\s*$/ {
    if (!/ --> /) next
    key = $3 " --> " $6
    if (!a[key]++) S = (S ? S SEP : "") key
}

END { print T, L, S }

笔记：

在中S = (S ? S SEP : "") key，三元组(S ? S SEP : "")是为了在连接字符串时避免前导 SEP，类似于连接T,中的情况L。
在中substr($0, RSTART+1, RLENGTH-2)，使用RSTART+1删除前导(，并RLENGTH-2删除两个括号

运行代码：

$ awk -f t20.awk file.txt
#Full Backup; Full Backup; Differential Incremental Backup, 3 months; 1 month; 3 weeks, 00:00:00 --> 07:00:00; 00:00:00 --> 10:00:00; 01:00:00 --> 16:00:00

更新：

根据您在评论中的描述，我对该Daily Windows部分的代码做了以下调整：

添加了一个标志dw_on来识别开始和结束每日窗户堵塞。应检查所有具有dw_on == 1并匹配该模式的行。每当检测到下一个空行时，该标志将重置为/ --> /S0/^\s*$/
添加了一个变量cnt_DW来计算数量每日窗户每个附表中的条目。这将在每个Daily Windows块的开始处重置

唯一性由哈希（关联数组）维护A，它将在每个块的开头重置Daily Windows。该散列的关键是key = $3 " --> " $6您要检索的窗口。语法：if (!a[key]++) S = (S ? S SEP : "") key与以下相同

  if (!a[key]) { 
      a[key] = a[key] + 1
      S = (S ? S SEP : "") key 
  }

因此，只有当某个键之前没有见过（a[key]=""）时，才可以key将 a 附加到S，第二次处理同一键时，它已经存在a[key]==1并将跳过上述代码块。这是awk检查唯一性的常用方法之一。

$ cat t20.1.awk
BEGIN { FS=":?   *"; OFS = ", "; SEP = "; "; }

# if $2 is "Type", append $3 to T
$2 == "Type" { T = (T ? T SEP : "") $3;}

# if $2 is "Retention Level", append sub-string in parenthesis to L
$2 == "Retention Level" && match($0, /\(.*?\)/) {
    L = (L ? L SEP : "") substr($0, RSTART+1, RLENGTH-2)
}

/Daily Windows:/ {
    # turn on the dw_on flag and reset cnt_DW (number of DW entries in a section)
    dw_on = 1; cnt_DW=0;
    # reset the hash 'a' for uniqueness check
    # if you need the uniqueness across all Schedules, then comment it out
    delete a; 
    next;
}

# if dw_on flag is true, i.e. "dw_on == 1"
dw_on {
    # match " --> ", then increase cnt_DW, check the unique window
    # and then append qualified entry to "S"
    if (/ --> /) {
        cnt_DW++
        key = $3 " --> " $6
        if (!a[key]++) S = (S ? S SEP : "") key
    # else if EMPTY line, reset dw_on flag, if cnt_DW is 0, append "No Window" to S
    } else if (/^\s*$/) {
        dw_on = 0;
        if (!cnt_DW) S = (S ? S SEP : "") "No Window"
    }
}

END { 
    # last Schedule section does not have a EMPTY line, so we will need
    # to check up cnt_DW in the last Schedule section in "END" block
    if(dw_on && !cnt_DW) S = (S ? S SEP : "") "No Window";

    # print the result.
    print T, L, S 
}

我对你的原始数据做了以下小修改来测试上面的代码：

Daily Windows删除了第二部分下的Schedule单独条目
Friday 00:00:00 --> Friday 07:00:00将第一个附表中的行替换为Friday 01:00:00 --> Friday 16:00:00与第三个附表部分中相同的行。

所以现在，在第 1 个附表中，有 2 个独特的窗口，在第 2 个附表中，没有窗口，在第 3 个附表中，有 1 个唯一窗口，与第 1 个附表中的窗口相同。

使用上述数据运行更新后的代码，您将得到：

awk -f t20.1.awk file.txt 
#Full Backup; Full Backup; Differential Incremental Backup, 3 months; 1 month; 3 weeks, 00:00:00 --> 07:00:00; 01:00:00 --> 16:00:00; No Window; 01:00:00 --> 16:00:00

请注意，有两个，01:00:00 --> 16:00:00因为它们位于不同的时间表中。如果你想删除最后一个01:00:00 --> 16:00:00，注释掉delete a代码中所示的行，你将得到以下结果：

#Full Backup; Full Backup; Differential Incremental Backup, 3 months; 1 month; 3 weeks, 00:00:00 --> 07:00:00; 01:00:00 --> 16:00:00; No Window

Answer