使用 AWK 过滤按时间戳不同的重复项

Question 1

$ tac file | awk '!seen[substr($0,1,length()-25)]++'
archive-daily/document-deb-report-2022-07-18-10-04-21.html
archive-daily/document-loan-report-2022-07-18-17-07-26.html
archive-daily/document-sell-report-2022-07-13-23-15-34.html

Answer

$ tac file | awk '!seen[substr($0,1,length()-25)]++'
archive-daily/document-deb-report-2022-07-18-10-04-21.html
archive-daily/document-loan-report-2022-07-18-17-07-26.html
archive-daily/document-sell-report-2022-07-13-23-15-34.html

Question 2

使用sed和tac

$ sed -En 'G;/^(([^-]*-){3}).*\n.*\n\1/d;H;P' <(tac input_file)
archive-daily/document-sell-report-2022-07-13-23-15-34.html
archive-daily/document-loan-report-2022-07-18-17-07-26.html
archive-daily/document-deb-report-2022-07-18-10-04-21.html

Answer

使用sed和tac

$ sed -En 'G;/^(([^-]*-){3}).*\n.*\n\1/d;H;P' <(tac input_file)
archive-daily/document-sell-report-2022-07-13-23-15-34.html
archive-daily/document-loan-report-2022-07-18-17-07-26.html
archive-daily/document-deb-report-2022-07-18-10-04-21.html

Question 3

使用AWK：程序保存前一个文件的前缀，获取当前文件的前缀并进行比较，如果前缀发生变化，则打印前一个文件名。

# myuniq.awk

BEGIN {
    last_prefix = 0
    last_line = 0
}

{
    if (match($0, /(-[[:digit:]]+){6}\.html$/) == 0)
        next

    prefix = substr($0, 1, RSTART - 1)
    if (last_prefix != 0 && prefix != last_prefix)
        print last_line

    last_prefix = prefix
    last_line = $0
}

END {
    if (last_line != 0)
        print last_line
}

$ cat files.txt
archive-daily/document-sell-report-2022-07-12-23-21-02.html
archive-daily/document-sell-report-2022-07-13-23-15-34.html
archive-daily/document-loan-report-2022-07-18-05-12-16.html
archive-daily/document-loan-report-2022-07-18-17-07-26.html
archive-daily/document-deb-report-2022-07-18-13-17-40.html
archive-daily/document-deb-report-2022-07-18-10-04-21.html
$ awk -f myuniq.awk < files.txt
archive-daily/document-sell-report-2022-07-13-23-15-34.html
archive-daily/document-loan-report-2022-07-18-17-07-26.html
archive-daily/document-deb-report-2022-07-18-10-04-21.html

Answer

使用AWK：程序保存前一个文件的前缀，获取当前文件的前缀并进行比较，如果前缀发生变化，则打印前一个文件名。

# myuniq.awk

BEGIN {
    last_prefix = 0
    last_line = 0
}

{
    if (match($0, /(-[[:digit:]]+){6}\.html$/) == 0)
        next

    prefix = substr($0, 1, RSTART - 1)
    if (last_prefix != 0 && prefix != last_prefix)
        print last_line

    last_prefix = prefix
    last_line = $0
}

END {
    if (last_line != 0)
        print last_line
}

$ cat files.txt
archive-daily/document-sell-report-2022-07-12-23-21-02.html
archive-daily/document-sell-report-2022-07-13-23-15-34.html
archive-daily/document-loan-report-2022-07-18-05-12-16.html
archive-daily/document-loan-report-2022-07-18-17-07-26.html
archive-daily/document-deb-report-2022-07-18-13-17-40.html
archive-daily/document-deb-report-2022-07-18-10-04-21.html
$ awk -f myuniq.awk < files.txt
archive-daily/document-sell-report-2022-07-13-23-15-34.html
archive-daily/document-loan-report-2022-07-18-17-07-26.html
archive-daily/document-deb-report-2022-07-18-10-04-21.html

Question 4

使用乐（以前称为 Perl_6）

~$ raku -e 'my @a = lines>>.split(/ "/" | [ <?after report> \- ]/);  \
            my %h.=append: [Z=>] @a>>[1], @a.map(*.[2]);  \
            for %h.sort {say (.key, .value.max).join("-")};'   file.txt

这个答案稍微简化了问题，并假设所有需要分析的文件都位于同一目录中。所以split第一步就关闭了目录。请注意，如果包含该目录，代码运行得很好（它只会变得key更长）。示例输出按字母“报告”顺序返回：

输入示例：

archive-daily/document-sell-report-2022-07-12-23-21-02.html
archive-daily/document-sell-report-2022-07-13-23-15-34.html
archive-daily/document-loan-report-2022-07-18-05-12-16.html
archive-daily/document-loan-report-2022-07-18-17-07-26.html
archive-daily/document-deb-report-2022-07-18-13-17-40.html
archive-daily/document-deb-report-2022-07-18-10-04-21.html

示例输出：

document-deb-report-2022-07-18-13-17-40.html
document-loan-report-2022-07-18-17-07-26.html
document-sell-report-2022-07-13-23-15-34.html

在第一个语句中lines，读入，在/斜线或-单词“report”后面的连字符处破坏性分割，并存储在@a数组中。%h声明一个散列，并将三个元素append添加到散列中，每个[Z]“Zip-reduction”都会拉出各自的数据存储，并=>在键值关系中添加“fat-arrow”。因此使用复合元[Z=>]运算符。因此，第一个元素成为key第二个元素 ( value) 的。然后max计算一个值，并join返回 ed 结果。

这就是有趣的地方。 RakuISO-8601内置了 DateTimes，因此可以subst替换第二个元素，使其被识别为DateTime对象！所以你可以获得max实际的DateTime：

~$ raku -e 'my @a = lines>>.split(/ "/" | [ <?after report> \- ]/);  \
            my %h.=append: [Z=>] @a>>[1], @a.map(*.[2].subst(/ \- (\d**2) \- (\d**2) \- (\d**2) \.html $/, {"T$0:$1:$2"} ).DateTime); \
            "".put; for %h.sort {say (.key => .value.max)};'  file.txt

document-deb-report => 2022-07-18T13:17:40Z
document-loan-report => 2022-07-18T17:07:26Z
document-sell-report => 2022-07-13T23:15:34Z

更多信息如下。请注意，max每个报告都会返回日期时间。这只是 OP 数据乱序（如 @QuartzCristal 所指出的）。

https://docs.raku.org/language/hashmap#Mutable_hashes_and_immutable_maps
https://docs.raku.org/language/operators#index-entry-[]_(reduction_metaoperators)
https://docs.raku.org/type/DateTime
https://raku.org

Answer