文本处理 - 如何对来自不同目录的具有重复文件名的 find 输出进行唯一排序?

文本处理 - 如何对来自不同目录的具有重复文件名的 find 输出进行唯一排序?

我想对 find 命令的输出进行唯一排序,在任何目录中都没有重复的文件名。

find /path/to/first_directory/* /path/to/second_directory/* /path/to/third_directory/* -mtime -1 -name "filename_pattern*"

示例输出:

/path/to/first_directory/sample_file1_2017Dec25.dat
/path/to/first_directory/sample_file2_2017Nov01.dat
/path/to/first_directory/sample_file3_2017Oct08.dat
/path/to/first_directory/archive/sample_file1_2017Dec25.dat.Z
/path/to/first_directory/archive/sample_file2_2017Nov01.dat.Z
/path/to/second_directory/sample_file4_2017Sep11.dat
/path/to/second_directory/sample_file5_2017Oct05.dat
/path/to/third_directory/sample_file1_2017Dec25.dat
/path/to/third_directory/sample_file2_2017Nov01.dat
/path/to/third_directory/sample_file3_2017Oct08.dat
/path/to/third_directory/sample_file4_2017Sep11.dat
/path/to/third_directory/sample_file5_2017Oct05.dat
/path/to/third_directory/sample_file6_2017July04.dat
/path/to/third_directory/sample_file6_2017June12.dat
/path/to/third_directory/sample_file7_2017May01.dat

从输出中您可以看到/first_directory/和中存在重复的文件名/first_directory/archive/,并且来自 和 的所有文件/first_directory/*也在/second_directory/*其中/third_directory/*。这意味着这是在和/third_directory/*中找到的所有文件的存档目录,但也有只能在(检查和)中找到的文件/first_directory/*/second_directory/*/third_directory/*sample_file6sample_file7

我想要打印的只是按此顺序从/first_directory//first_directory/archive//second_directory/的文件/third_directory/,没有重复,并且还按日期排序。

期望的输出:

/path/to/first_directory/sample_file1_2017Dec25.dat
/path/to/first_directory/sample_file2_2017Nov01.dat
/path/to/first_directory/sample_file3_2017Oct08.dat
/path/to/second_directory/sample_file4_2017Sep11.dat
/path/to/second_directory/sample_file5_2017Oct05.dat
/path/to/third_directory/sample_file6_2017July04.dat
/path/to/third_directory/sample_file6_2017June12.dat
/path/to/third_directory/sample_file7_2017May01.dat

答案1

如果 find 命令的输出保存在名为 的文件中filelist,则尝试:

$ awk -F/ '{f=$NF; sub(/\.Z$/,"",f)} !a[f]++' filelist
/path/to/first_directory/sample_file1_2017Dec25.dat
/path/to/first_directory/sample_file2_2017Nov01.dat
/path/to/first_directory/sample_file3_2017Oct08.dat
/path/to/second_directory/sample_file4_2017Sep11.dat
/path/to/second_directory/sample_file5_2017Oct05.dat
/path/to/third_directory/sample_file6_2017July04.dat
/path/to/third_directory/sample_file6_2017June12.dat
/path/to/third_directory/sample_file7_2017May01.dat

如果您想在不创建文件的情况下执行相同的操作:

find /path/to/first_directory/* /path/to/second_directory/* /path/to/third_directory/* -mtime -1 -name "filename_pattern*" | awk -F/ '{f=$NF; sub(/\.Z$/,"",f)} !a[f]++'

或者,您更喜欢将命令分散到多行中,请使用:

find /path/to/first_directory/* /path/to/second_directory/* \
  /path/to/third_directory/* -mtime -1 -name "filename_pattern*" |
    awk -F/ '{f=$NF; sub(/\.Z$/,"",f)} !a[f]++'

我们添加\到第一行的末尾,因为这是 bash 的行继续字符。因为第二行以 结尾|,所以不需要行继续符。

怎么运行的

find首先,重要的是要按照优先级顺序在命令中列出目录。我看到你已经这么做了。

  1. -F/

    这告诉 awk 用作/字段分隔符。这意味着文件名将是最后一个字段$NF.

  2. f=$NF; sub(/\.Z$/,"",f)

    这会将文件名分配给变量f,然后删除最终的文件名.Zf如果存在)。

  3. !a[f]++'

    如果f以前没有见过,请打印此行。

更新1:删除其他扩展

根据评论,.Z这并不是唯一需要删除的扩展。可能还有其他扩展.dat.edi.dat.bak应该简单地替换为.dat.在这种情况下:

awk -F/ '{f=$NF; sub(/\.dat.*/,".dat",f)} !a[f]++' filelist

更新 2:显示按时间戳排序的文件:

awk -F/ '{f=$NF; sub(/\.dat.*/,".dat",f)} !a[f]++' filelist | xargs -d'\n' -r ls -t

相关内容