如何按多级文件名/路径的子字符串对文件名列表(txt 文件)进行排序。特别挑战:两种类型的文件名约定

如何按多级文件名/路径的子字符串对文件名列表(txt 文件)进行排序。特别挑战:两种类型的文件名约定

我想对以下文件名/路径列表进行排序。

L1_Data/level1/192027/LC08_L1TP_192027_20201126_20210316_01_T1 DONE
L1_Data/level1/192028/LC08_L1TP_192028_20201126_20210316_01_T1 DONE
L1_Data/level1/192029/LC08_L1TP_192029_20201126_20210316_01_T1 DONE
L1_Data/level1/191027/LE07_L1TP_191027_20201127_20201223_01_T1 DONE
L1_Data/level1/191029/LE07_L1TP_191029_20201127_20201223_01_T1 DONE
L1_Data/level1/192027/LC08_L1TP_192027_20201212_20210313_01_T1 QUEUED
L1_Data/level1/191028/LE07_L1TP_191028_20201213_20210108_01_T1 DONE
L1_Data/level1/191029/LE07_L1TP_191029_20201213_20210108_01_T1 DONE
L1_Data/level1/191027/LC08_L1TP_191027_20201221_20210310_01_T1 DONE
L1_Data/level1/T32TQS/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQS_20200101T110654.SAFE DONE
L1_Data/level1/T32TQR/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQR_20200101T110654.SAFE QUEUED
L1_Data/level1/T33TUL/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUL_20200101T110654.SAFE DONE
L1_Data/level1/T33TUM/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUM_20200101T110654.SAFE DONE
L1_Data/level1/T32TQS/S2A_MSIL1C_20200102T102421_N0208_R065_T32TQS_20200102T105534.SAFE DONE
L1_Data/level1/T33TUL/S2B_MSIL1C_20200104T101319_N0208_R022_T33TUL_20200104T121239.SAFE DONE
L1_Data/level1/T32TQR/S2B_MSIL1C_20200104T101319_N0208_R022_T32TQR_20200104T121239.SAFE QUEUED
L1_Data/level1/T32TQS/S2A_MSIL1C_20200106T100401_N0208_R122_T32TQS_20200106T103423.SAFE DONE

每行包含一个文件名(包括路径)及其工作状态(已排队/完成)。每个文件名包含卫星图像数据的信息,如卫星类型、记录日期、足迹等。

现在,我想根据以下优先级对列表进行重新排序:

  1. 工作状态-->已排队首先。作为一个步骤,这对我来说不是问题,但后续步骤的解决方案包括它们的组合(您将在下一张图片之后找到对我的问题的更详细的描述):
  2. 卫星类型(S2A=Sentinel A;S2B=Sentinel B;LC08=Landsat 8;LE07=Landsat 7)-->S2A/B开头(无论A还是B),然后是LC08,然后是LE07。换句话说:我想区分 Sentinel 2、Landsat 8 和 Landsat 7,但不是Sentinel 2A 和 Sentinel 2B 之间。
  3. 记录日期,升序
  4. 足迹,上升

下图显示了相应子字符串的位置,后面是我的问题的描述。

在此输入图像描述

除了只有非常基本的知识之外种类命令,我的具体问题是:

  • a) 正确寻址子串,在
  • b) 两种不同的文件名类型(/约定),
  • c) 下划线不能用作分隔符,因为在 Sentinel 文件名中有五个下划线,在 Landsat 文件名中有六个下划线,除此之外,两者之间的子字符串序列不同。
  • d) 命令S2A/BLC08LE07不幸的是不是按照字母表排列的,并且
  • e) 解决S2AS2B卫星作为一个整体。这当然可以通过仅解决S2,但是,由于仅由两个字符组成,因此存在与整个文件名字符串的其他部分混淆的一定风险(实际上该列表要长得多并且会不时更新,因此可能包含“false”S2s 在其他或未来的行中)。

最后,重新排序的列表应如下所示:

L1_Data/level1/T32TQR/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQR_20200101T110654.SAFE QUEUED
L1_Data/level1/T32TQR/S2B_MSIL1C_20200104T101319_N0208_R022_T32TQR_20200104T121239.SAFE QUEUED
L1_Data/level1/192027/LC08_L1TP_192027_20201212_20210313_01_T1 QUEUED
L1_Data/level1/T32TQS/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQS_20200101T110654.SAFE DONE
L1_Data/level1/T33TUL/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUL_20200101T110654.SAFE DONE
L1_Data/level1/T33TUM/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUM_20200101T110654.SAFE DONE
L1_Data/level1/T32TQS/S2A_MSIL1C_20200102T102421_N0208_R065_T32TQS_20200102T105534.SAFE DONE
L1_Data/level1/T33TUL/S2B_MSIL1C_20200104T101319_N0208_R022_T33TUL_20200104T121239.SAFE DONE
L1_Data/level1/T32TQS/S2A_MSIL1C_20200106T100401_N0208_R122_T32TQS_20200106T103423.SAFE DONE
L1_Data/level1/192027/LC08_L1TP_192027_20201126_20210316_01_T1 DONE
L1_Data/level1/192028/LC08_L1TP_192028_20201126_20210316_01_T1 DONE
L1_Data/level1/192029/LC08_L1TP_192029_20201126_20210316_01_T1 DONE
L1_Data/level1/191028/LE07_L1TP_191028_20201213_20210108_01_T1 DONE
L1_Data/level1/191029/LE07_L1TP_191029_20201213_20210108_01_T1 DONE
L1_Data/level1/191027/LE07_L1TP_191027_20201127_20201223_01_T1 DONE
L1_Data/level1/191029/LE07_L1TP_191029_20201127_20201223_01_T1 DONE

有人可以帮我吗?

答案1

问题是排序字段不在行中的同一列中。

我在这里使用 perl 以获得最大的灵活性:这是“custom_sort.pl”

#! perl

while (<>) {
    # capture the fields of an "L" satellite
    if (/.*\/(L...)_.*?_(\d+)_(\d+)\S+\s+(.*)/) {
        push @data, [$_, $4, $1, $3, $2]
    }
    # capture the fields of an "S" satellite
    elsif (/.*\/(S..)_.*?_(\d{8}).*?_.*?_.*?_(.*?)_\S+\s+(.*)/) {
        push @data, [$_, $4, $1, $2, $3]
    }
}

sub mysort {
    -($a->[1] cmp $b->[1])              # work status, descending
    || cmp_satellite($a->[2], $b->[2])  # satellite
    || $a->[3] <=> $b->[3]              # record date
    || $a->[4] cmp $b->[4]              # footprint
}
sub cmp_satellite {
    my ($a, $b) = @_;
    return -1 if $a =~ /^S/;
    return +1 if $b =~ /^S/;
    $a cmp $b
}

print $_->[0] for sort mysort @data

运行它

perl custom_sort.pl file

答案2

使用awk,sortcut:

awk -F'[/ ]' -v OFS='\t' '
{
  status=$NF # this is the last field

  split($(NF-1), parts, "_") # split filename into array `parts`

  if (parts[1]=="S2A" || parts[1]=="S2B") type=1
  else if (parts[1]=="LC08"){ type=2 }
  else if (parts[1]=="LE07"){ type=3 }
  else { print "error, got unknown type " parts[1]; exit 1 }

  date=(type==1 ? substr(parts[3], 1, 8) : parts[4])
  footprint=(type==1 ? parts[6] : parts[3])
  
  print status, type, date, footprint, $0
}
' file | sort -k1,1r -k2,2n -k3,3 -k4,4 | cut -f5-

这个想法是从每个记录中提取工作状态、卫星类型、记录日期和足迹并将它们保存在四个变量中,类型被数字替换以定义自定义顺序。

然后打印这四个变量(以制表符分隔并以原始记录为后缀),根据需要对输出进行排序,然后用 删除前四个字段cut

输出:

L1_Data/level1/T32TQR/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQR_20200101T110654.SAFE QUEUED
L1_Data/level1/T32TQR/S2B_MSIL1C_20200104T101319_N0208_R022_T32TQR_20200104T121239.SAFE QUEUED
L1_Data/level1/192027/LC08_L1TP_192027_20201212_20210313_01_T1 QUEUED
L1_Data/level1/T32TQS/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQS_20200101T110654.SAFE DONE
L1_Data/level1/T33TUL/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUL_20200101T110654.SAFE DONE
L1_Data/level1/T33TUM/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUM_20200101T110654.SAFE DONE
L1_Data/level1/T32TQS/S2A_MSIL1C_20200102T102421_N0208_R065_T32TQS_20200102T105534.SAFE DONE
L1_Data/level1/T33TUL/S2B_MSIL1C_20200104T101319_N0208_R022_T33TUL_20200104T121239.SAFE DONE
L1_Data/level1/T32TQS/S2A_MSIL1C_20200106T100401_N0208_R122_T32TQS_20200106T103423.SAFE DONE
L1_Data/level1/192027/LC08_L1TP_192027_20201126_20210316_01_T1 DONE
L1_Data/level1/192028/LC08_L1TP_192028_20201126_20210316_01_T1 DONE
L1_Data/level1/192029/LC08_L1TP_192029_20201126_20210316_01_T1 DONE
L1_Data/level1/191027/LC08_L1TP_191027_20201221_20210310_01_T1 DONE
L1_Data/level1/191027/LE07_L1TP_191027_20201127_20201223_01_T1 DONE
L1_Data/level1/191029/LE07_L1TP_191029_20201127_20201223_01_T1 DONE
L1_Data/level1/191028/LE07_L1TP_191028_20201213_20210108_01_T1 DONE
L1_Data/level1/191029/LE07_L1TP_191029_20201213_20210108_01_T1 DONE

相关内容