我试图在文件中查找重复项,一旦找到匹配项,就在行末尾用字符或单词标记第一个匹配项。
例如我的文件(test.html)包含以下条目
host= alpha-sfserver1
host= alphacrest3
host= alphacrest4
host= alphactn1
host= alphactn2
host= alphactn3
host= alphactn4
down alphacrest4
我可以使用以下命令找到重复项:-(我使用 $2,因为重复项始终位于第 2 列中)
awk '{if (++dup[$2] == 1) print $0;}' test.html
它删除了最后一个条目(在 alphacrest4 下),但我想要的是用单词或字符标记重复的条目,例如:-
host= alphacrest4 acked
非常欢迎任何帮助。
答案1
您需要处理该文件两次。在第一次运行中,您将欺骗内容写入文件中:
awk '{if (++dup[$2] == 1) print $2;}' test.html > dupes.txt
第二次运行将所有行与文件内容进行比较:
awk 'BEGIN { while (getline var <"dupes.txt") { dup2[var]=1; }};
{ num=++dup[$2]
if (num == 1) { if (1 == dup2[$2]) print $0 " acked"; else print $0;} }' \
test.html
答案2
如果我们有整个文件,这会容易得多。您是否只对以host=
或开头的行感兴趣任何第二个字段?对于通用解决方案,请尝试以下操作:
perl -e '@file=<>;
foreach(map{/.+?\s+(.+)/;}@file){$dup{$_}++};
foreach(@file){
chomp;
/.+?\s+(.+)/;
if($dup{$1}>1 && not defined($p{$1})){
print "$_ acked\n";
$p{$1}++;}
else{print "$_\n"}
}' test.html
上面的脚本将首先读取整个文件,检查重复项,然后打印每个重复行,然后打印“acked”。
如果我们可以假设您只对以下开头的行感兴趣,那么整个事情就会简单得多down X
:
grep down test.html | awk '{printf $2}' |
perl -e 'while(<>){$dup{$_}++}open(A,"test.html");
while(<A>){
if(/host=\s+(.+)/ && defined($dup{$1})){
chomp; print "$_ acked\n"}
else{print}}'
答案3
这可以帮助:
单线:
awk 'NR==FNR{b[$2]++; next} $2 in b { if (b[$2]>1) { print $0" acked" ; delete b[$2]} else print $0}' inputFile inputFile
说明:
awk '
NR==FNR {
## Loop through the file and check which line is repeated based on column 2
b[$2]++
## Skip the rest of the actions until complete file is scanned
next
}
## Once the scan is complete, look for second column in the array
$2 in b {
## If the count of the column is greater than 1 it means there is duplicate.
if (b[$2]>1) {
## So print that line with "acked" marker
print $0" acked"
## and delete the array so that it is not printed again
delete b[$2]
}
## If count is 1 it means there was no duplicate so print the line
else
print $0
}' inputFile inputFile