我有两个以下格式的 csv 文件:
column1,column2,user,column4,column5,column...column14
我想根据用户作为第一个键对文件进行排序,然后根据 column14 按降序排列,这是一个代表时间戳的数字。
结果会是这样的:
文件1:
A,AA,jim,XX,YY,....,1485771395
A,AA,jim,XX,YY,....,1485771395
ETC
然后我想从这些文件中仅保留每个簇的第一行,即仅
A,AA,jim,XX,YY,....,1485771395
我怎样才能做到这一点?
更新:
输入示例:
"1/30/2017 11:14:55 AM",Valid customer,jim.smith,NY,1485771295
"1/26/2017 5:06:11 AM",New customer,john.doe,CA,1485403571
"1/26/2017 8:02:01 PM",Valid customer,jim.smith,NY,1485457321
"1/30/2017 4:14:30 AM",New customer,tim.jones,CO,1485746070
"1/30/2017 11:09:36 AM",New customer,tim.jones,CO,1485770976
"1/30/2017 11:14:50 AM",Valid customer,jim.smith,NY,1485771290
"1/22/2017 11:51:51 AM",New customer,tim.jones,CO,1485082311
最后一列是第一列的纪元,能够使用数字而不是字符串对行进行排序。
因此,预期的输出将是:
a)按名称和时间戳按相反顺序排序,因此我们得到:
"1/30/2017 11:14:55 AM",Valid customer,jim.smith,NY,1485771295
"1/30/2017 11:14:50 AM",Valid customer,jim.smith,NY,1485771290
"1/26/2017 8:02:01 PM",Valid customer,jim.smith,NY,1485457321
"1/26/2017 5:06:11 AM",New customer,john.doe,CA,1485403571
"1/30/2017 11:09:36 AM",New customer,tim.jones,CO,1485770976
"1/30/2017 4:14:30 AM",New customer,tim.jones,CO,1485746070
"1/22/2017 11:51:51 AM",New customer,tim.jones,CO,1485082311
因此,我们有一个行簇,jim.smith
用于tim.jones
按最后一列以相反顺序排序(即第一行的最新日期在簇中的第一个),并且该行john.doe
只有 1 条记录。
然后我想只保留每个簇的第一行。 IE
"1/30/2017 11:14:55 AM",Valid customer,jim.smith,NY,1485771295
"1/26/2017 5:06:11 AM",New customer,john.doe,CA,1485403571
"1/30/2017 11:09:36 AM",New customer,tim.jones,CO,1485770976
答案1
给定文件input
为
"1/30/2017 11:14:55 AM",Valid customer,jim.smith,NY,1485771295
"1/26/2017 5:06:11 AM",New customer,john.doe,CA,1485403571
"1/26/2017 8:02:01 PM",Valid customer,jim.smith,NY,1485457321
"1/30/2017 4:14:30 AM",New customer,tim.jones,CO,1485746070
"1/30/2017 11:09:36 AM",New customer,tim.jones,CO,1485770976
"1/30/2017 11:14:50 AM",Valid customer,jim.smith,NY,1485771290
"1/22/2017 11:51:51 AM",New customer,tim.jones,CO,1485082311
然后
sort -t, -k3,3 -k5,5rn input | awk -F, '!seen[$3]++'
"1/30/2017 11:14:55 AM",Valid customer,jim.smith,NY,1485771295
"1/26/2017 5:06:11 AM",New customer,john.doe,CA,1485403571
"1/30/2017 11:09:36 AM",New customer,tim.jones,CO,1485770976
答案2
datamash -t, -s -f -g 3 max 5 <infile
但请注意,在运行之前,您需要删除输入中每行的所有尾随空格。
仅awk
使用两个数组:
awk -F, '!z[$3]{x[$3]=$0;z[$3]=$5;next}$5>z[$3]{x[$3]=$0}
END{for (i in z){print x[i]}}' infile