如何按 2 列排序并保留创建的每个组的顶行?

如何按 2 列排序并保留创建的每个组的顶行?

我有两个以下格式的 csv 文件:

column1,column2,user,column4,column5,column...column14  

我想根据用户作为第一个键对文件进行排序,然后根据 column14 按降序排列,这是一个代表时间戳的数字。

结果会是这样的:

文件1:

A,AA,jim,XX,YY,....,1485771395    
A,AA,jim,XX,YY,....,1485771395    

ETC

然后我想从这些文件中仅保留每个簇的第一行,即仅

A,AA,jim,XX,YY,....,1485771395  

我怎样才能做到这一点?

更新:
输入示例:

"1/30/2017 11:14:55 AM",Valid customer,jim.smith,NY,1485771295      
"1/26/2017 5:06:11 AM",New customer,john.doe,CA,1485403571  
"1/26/2017 8:02:01 PM",Valid customer,jim.smith,NY,1485457321  
"1/30/2017 4:14:30 AM",New customer,tim.jones,CO,1485746070  
"1/30/2017 11:09:36 AM",New customer,tim.jones,CO,1485770976  
"1/30/2017 11:14:50 AM",Valid customer,jim.smith,NY,1485771290      
"1/22/2017 11:51:51 AM",New customer,tim.jones,CO,1485082311  

最后一列是第一列的纪元,能够使用数字而不是字符串对行进行排序。
因此,预期的输出将是:
a)按名称和时间戳按相反顺序排序,因此我们得到:

"1/30/2017 11:14:55 AM",Valid customer,jim.smith,NY,1485771295    
"1/30/2017 11:14:50 AM",Valid customer,jim.smith,NY,1485771290        
"1/26/2017 8:02:01 PM",Valid customer,jim.smith,NY,1485457321  
"1/26/2017 5:06:11 AM",New customer,john.doe,CA,1485403571  
"1/30/2017 11:09:36 AM",New customer,tim.jones,CO,1485770976  
"1/30/2017 4:14:30 AM",New customer,tim.jones,CO,1485746070  
"1/22/2017 11:51:51 AM",New customer,tim.jones,CO,1485082311  

因此,我们有一个行簇,jim.smith用于tim.jones按最后一列以相反顺序排序(即第一行的最新日期在簇中的第一个),并且该行john.doe只有 1 条记录。

然后我想只保留每个簇的第一行。 IE

"1/30/2017 11:14:55 AM",Valid customer,jim.smith,NY,1485771295    
"1/26/2017 5:06:11 AM",New customer,john.doe,CA,1485403571    
"1/30/2017 11:09:36 AM",New customer,tim.jones,CO,1485770976  

答案1

给定文件input

"1/30/2017 11:14:55 AM",Valid customer,jim.smith,NY,1485771295      
"1/26/2017 5:06:11 AM",New customer,john.doe,CA,1485403571  
"1/26/2017 8:02:01 PM",Valid customer,jim.smith,NY,1485457321  
"1/30/2017 4:14:30 AM",New customer,tim.jones,CO,1485746070  
"1/30/2017 11:09:36 AM",New customer,tim.jones,CO,1485770976  
"1/30/2017 11:14:50 AM",Valid customer,jim.smith,NY,1485771290      
"1/22/2017 11:51:51 AM",New customer,tim.jones,CO,1485082311  

然后

sort -t, -k3,3 -k5,5rn input | awk -F, '!seen[$3]++'
"1/30/2017 11:14:55 AM",Valid customer,jim.smith,NY,1485771295      
"1/26/2017 5:06:11 AM",New customer,john.doe,CA,1485403571  
"1/30/2017 11:09:36 AM",New customer,tim.jones,CO,1485770976  

答案2

gnu datamash

datamash -t, -s -f -g 3 max 5 <infile

但请注意,在运行之前,您需要删除输入中每行的所有尾随空格。
awk使用两个数组:

awk -F, '!z[$3]{x[$3]=$0;z[$3]=$5;next}$5>z[$3]{x[$3]=$0}
END{for (i in z){print x[i]}}' infile

相关内容