我的 CSV 文件中有许多 10 列,其中第 1、3 和 4 列中有重复记录。需要识别这些重复记录并将它们存储到单独的TXT文件中。文件以“,”分隔
例如:
14479305,I,101013000000,I,31/03/2019,TD01,16573.34,0,C,6/03/2015
14479305,I,101013000000,I,31/03/2019,TD01,16573.34,0,C,6/03/2015
14479305,I,101013000000,I,31/03/2019,TD01,6591.5,0,O,23/03/2015
14479305,I,101013000000,I,31/03/2019,TD01,305592.33,0,C,27/07/2017
13256433,I,101095000000,I,31/03/2019,TD21,102992.06,0,C,20/03/2017
13256492,I,101095000000,I,31/03/2019,TD21,102992.06,0,C,20/03/2017
11595044,I,101225000000,I,31/03/2019,TD01,70000,0,O,14/10/2016
13222989,I,101296000000,I,31/03/2019,TD01,53850,0,O,21/09/2018
13222989,I,101296000000,I,31/03/2019,TD01,534850,0,O,21/09/2019
输出文件
14479305,I,101013000000,I,31/03/2019,TD01,16573.34,0,C,6/03/2015
14479305,I,101013000000,I,31/03/2019,TD01,6591.5,0,O,23/03/2015
14479305,I,101013000000,I,31/03/2019,TD01,305592.33,0,C,27/07/2017
13222989,I,101296000000,I,31/03/2019,TD01,53850,0,O,21/09/2018
13222989,I,101296000000,I,31/03/2019,TD01,534850,0,O,21/09/2019
我写过但没有工作的 AWK 命令是:
awk -F, 'dups{a[$1,$3,$4]++; next} a[$1,$3,$4]>1' source.CSV > Dups.txt
请指教
答案1
你就快到了!只需要处理输入文件两次(加上一些修复),第一次根据您的条件(列#1、#3、#4)查找重复项计数,第二次过滤掉那些 count>1 的行输出文件,所以:
awk -F, 'NR==FNR{ dups[$1,$3,$4]++; next} dups[$1,$3,$4]>1' source.CSV source.CSV > Dups.txt
答案2
由于您的输入是按键值分组的,因此您可以在一次输入中完成此操作,并且一次仅在内存中存储 1 行:
$ cat tst.awk
BEGIN { FS = "," }
{ key = $1 FS $3 FS $4 }
key == prevKey {
print prevRec $0
prevRec = ""
next
}
{
prevRec = $0 ORS
prevKey = key
}
$ awk -f tst.awk file
14479305,I,101013000000,I,31/03/2019,TD01,16573.34,0,C,6/03/2015
14479305,I,101013000000,I,31/03/2019,TD01,16573.34,0,C,6/03/2015
14479305,I,101013000000,I,31/03/2019,TD01,6591.5,0,O,23/03/2015
14479305,I,101013000000,I,31/03/2019,TD01,305592.33,0,C,27/07/2017
13222989,I,101296000000,I,31/03/2019,TD01,53850,0,O,21/09/2018
13222989,I,101296000000,I,31/03/2019,TD01,534850,0,O,21/09/2019
如果不是这样分组的,那就这样做sort -t, -k1,1 -k3,4 file | awk -f tst.awk
吧。
我假设上面您希望打印具有重复键的所有行,而不仅仅是每个键第一次出现之后的行。