查找并替换重复项

Question 1

这是一个 sed 解决方案，适用于您的确切输入格式，并且希望运行速度也很快。

sed -rz 's:[ \t]+:,:g;s:$:,:mg;:l;s:,([^,]+),(.*),\1,:,\1,\2,:;tl;s:,$::mg;s:^([^,]+),:\1\t:mg' file.csv

怎么运行的：

“-z”标志加载整个文件，因此以下代码仅应用一次，而不是像默认情况那样应用到每一行。

#transform input format to actual CSV format
s:[ \t]+:,:g;s:$:,:mg;
#loop while the s command can still find and replace
:l;
    #main code: find two identical cell values anywhere and delete the latter
    #on a very big file this can suffer from backtracking nightmare
    s:,([^,]+),(.*),\1,:,\1,\2,:;
tl;
#transform format back
s:,$::mg;s:^([^,]+),:\1\t:mg

Answer

这是一个 sed 解决方案，适用于您的确切输入格式，并且希望运行速度也很快。

sed -rz 's:[ \t]+:,:g;s:$:,:mg;:l;s:,([^,]+),(.*),\1,:,\1,\2,:;tl;s:,$::mg;s:^([^,]+),:\1\t:mg' file.csv

怎么运行的：

“-z”标志加载整个文件，因此以下代码仅应用一次，而不是像默认情况那样应用到每一行。

#transform input format to actual CSV format
s:[ \t]+:,:g;s:$:,:mg;
#loop while the s command can still find and replace
:l;
    #main code: find two identical cell values anywhere and delete the latter
    #on a very big file this can suffer from backtracking nightmare
    s:,([^,]+),(.*),\1,:,\1,\2,:;
tl;
#transform format back
s:,$::mg;s:^([^,]+),:\1\t:mg

Question 2

如果您的文件是真正的 csv 文件（simple-csv），如下所示，您可以使用以下awk命令：

输入：

[email protected]
[email protected]
[email protected],[email protected],[email protected]

命令：

awk -F, '{ COMMA="";i=0; while (++i<=NF) {
           $1=$i; printf (!seen[$1]++)?COMMA$i:""; COMMA=","}; print ""
}' infile.csv

输出：

[email protected]
[email protected]
[email protected],[email protected]

如果没有，并且输入就像您问题中给出的那样，您可以使用以下内容：

awk  'NR==1; NR>1{id=$1"\t"; COMMA=$1="";split($0, ar, /,| /); 
    for(i in ar){if(ar[i]!=""){printf(!seen[ar[i]]++)?id""COMMA""ar[i]:""; COMMA=",";id=""}
} print ""}' infile

输出：

id  emails
1       [email protected]
2       [email protected]
3       [email protected],[email protected]

Answer

如果您的文件是真正的 csv 文件（simple-csv），如下所示，您可以使用以下awk命令：

输入：

[email protected]
[email protected]
[email protected],[email protected],[email protected]

命令：

awk -F, '{ COMMA="";i=0; while (++i<=NF) {
           $1=$i; printf (!seen[$1]++)?COMMA$i:""; COMMA=","}; print ""
}' infile.csv

输出：

[email protected]
[email protected]
[email protected],[email protected]

如果没有，并且输入就像您问题中给出的那样，您可以使用以下内容：

awk  'NR==1; NR>1{id=$1"\t"; COMMA=$1="";split($0, ar, /,| /); 
    for(i in ar){if(ar[i]!=""){printf(!seen[ar[i]]++)?id""COMMA""ar[i]:""; COMMA=",";id=""}
} print ""}' infile

输出：

id  emails
1       [email protected]
2       [email protected]
3       [email protected],[email protected]

查找并替换重复项

答案1

怎么运行的：

答案2

相关内容