我正在使用一个 CSV 数据集,如下所示:
year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK- DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,
它包含许多我需要删除的重复项(保留记录的一个实例)。基于从 CSV 文件中删除重复条目我已经使用过,sort -u file.csv --o deduped-file.csv
对于像这样的例子效果很好
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
但没有捕获类似的例子
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,
数据不完整,但代表同一事物。
是否可以根据指定字段(例如年份、制造商、品牌、系列、变体)删除重复项?
答案1
我将创建前 5 个字段的“键”,然后仅在第一次看到该键时打印一行:
awk -F, '
{key = $1 FS $2 FS $3 FS $4 FS $5}
!seen[key]++
' file
year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,
答案2
首先使用 awk 进行装饰,以在排序期间将标题行保留在顶部,然后使用任何 awk+sort+cut:
$ awk -v OFS=',' '{print (NR>1), $0}' file | sort -u -t, -k1,6 | cut -d, -f2-
year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
答案3
非awk
解决方案:可以简单地使用sort
,将字段 1 到 5 定义为比较字段-k
,用作,
字段分隔符 ( -t
) 并-u
仅选择唯一的条目:
sort -t, -k1,5 -u infile
然而,这将使标题行位于最后。使用-r
反向排序或通过例如分隔标题行
{ sed 1q infile ; sed 1d infile | sort -k1,5 -t, -u ; }