我有一个制表符分隔的文件,如下所示,并且希望根据任何列中的匹配来合并行。列数通常为 2,但在某些情况下可能会有所不同,为 3。
输入:
AMAZON NILE
ALASKA NILE
HELLO MY
MANGROVE AMAZON
MY NAME
IS NAME
期望的输出:
AMAZON NILE ALASKA MANGROVE
HELLO MY NAME IS
怎样才能解决这个问题呢awk
?
这也适用于以下文件吗?输入:
apple_bin2file strawberry_24files
mango2files strawberry_39files
apple_bin8file strawberry_39files
dastool_bin6files strawberry_40files
apple_bin6file strawberry_40files
orange_bin004file dastool_bin004files
orange_bin005file dastool_bin005files
apple_bin3file dastool_bin3files
apple_bin5file dastool_bin5files
apple_bin6file dastool_bin6files
apple_bin7file dastool_bin7files
apple_bin8file mango2files
制表符分隔格式的预期输出:
apple_bin2file strawberry_24files
mango2files strawberry_39files apple_bin8file
dastool_bin6files strawberry_40files apple_bin6file
orange_bin004file dastool_bin004files
orange_bin005file dastool_bin005files
apple_bin3file dastool_bin3files
apple_bin5file dastool_bin5files
apple_bin7file dastool_bin7files
对于那些回答的人,我很抱歉,我更新了输入文件!
答案1
使用 GNU awk
gawk '
{
grp = 0
# see if any of these words already have a group
for (i=1; i<=NF; i++) {
if (group[$i]) {
grp = group[$i]
break
}
}
# no words have been seen before: new group
if (!grp) {
grp = ++n
}
# if we have not seen this word, add it to the output
for (i=1; i<=NF; i++) {
if (!group[$i]) {
line[grp] = line[grp] $i OFS
}
group[$i] = grp
}
}
END {
PROCINFO["sorted_in"] = "@ind_num_asc"
for (n in line) {
print line[n]
}
}
' input.file
第一个输入:
AMAZON NILE ALASKA MANGROVE
HELLO MY NAME IS
使用第二个输入(将输出通过管道传输到column -t
):
apple_bin2file strawberry_24files
mango2files strawberry_39files apple_bin8file
dastool_bin6files strawberry_40files apple_bin6file
orange_bin004file dastool_bin004files
orange_bin005file dastool_bin005files
apple_bin3file dastool_bin3files
apple_bin5file dastool_bin5files
apple_bin7file dastool_bin7files
答案2
对于您给出的示例,请尝试
awk '
{for (j=1; j<=MX; j++) {for (i=1; i<=NF && !(m=match (LN[j], $i)); i++);
if (m) {$i = ""
break
}
}
LN[j] = LN[j] $0 " "
if (j>MX) MX = j
}
END {for (l in LN) print LN[l]
}
' file3
AMAZON NILE ALASKA MANGROVE
HELLO MY NAME IS
编辑:使用新数据,这应该有效:
awk '
{for (j=1; j<=MX; j++) {m = 0
for (i=1; i<=NF; i++) {if (match (LN[j], $i)) {$i = ""
m = 1
}
}
if (m) break
}
LN[j] = LN[j] $0 OFS
if (j>MX) MX = j
}
END {for (l in LN) {gsub (/ +/, OFS, LN[l])
gsub (OFS"+", OFS, LN[l])
print LN[l]
}
}
' OFS="\t" file