我有一个以下文件:
a3 v2c
v5 a7
a9 v2c
v1c a3 a7c
所需的输出(每行中没有重复项):
a3 a7c a9 v1c v2c
a7 v5
我想要的是合并至少共享一个元素的行。在第 2 行中,两个元素都是唯一的,并且该行按原样输出(按排序顺序)。第 1 行与第 3 行共享“v2c”,与第 4 行共享“a3”,因此这 3 行被组合并排序。共享元素可以位于不同的列中。
对于大文件(200000 行),我的代码非常慢:
Lines=$(awk 'END {print NR}' $1)
bank=$1
while [ $Lines -ge 1 ]
do
echo "Processing line $Lines"
awk -v line=$Lines 'NR == line' $bank | awk NF=NF RS= OFS="\n" | sort | uniq > Query.$Lines
k=0
while [[ $k != 1 ]]
do
if [[ $k != "" ]]
then
grep -f Query.$Lines $bank | awk '{gsub(/\t/,"\n")}1' | awk '{gsub(/ /,"\n")}1' | sort | uniq > Query1.$Lines
grep -f Query1.$Lines $bank | awk '{gsub(/\t/,"\n")}1' | awk '{gsub(/ /,"\n")}1' | sort | uniq > Query2.$Lines
grep -f Query2.$Lines $bank | awk '{gsub(/\t/,"\n")}1' | awk '{gsub(/ /,"\n")}1' | sort | uniq > Query3.$Lines
k=$(diff Query2.$Lines Query3.$Lines)
if [[ $k != "" ]]
then mv Query3.$Lines Query.$Lines
fi
else
awk NF=NF RS= OFS=" " Query3.$Lines >> $1.output.clusters
grep -v -f Query3.$Lines $bank > NotFound.$Lines
bank=NotFound.$Lines
k=1
fi
done
rm Query*
Lines=$(( $Lines - 1 ))
done
exit
find . -maxdepth 1 -type f -size 0 -delete
rm NotFound.* Query.* Query1.* Query2.* Query3.*
我相信使用 bash 或 awk 可以有一个更简单、更有效的解决方案。提前致谢!
答案1
将 GNU awk 用于数组的数组 和sorted_in
:
$ cat tst.awk
{
for ( fldNrA=1; fldNrA<NF; fldNrA++ ) {
fldValA = $fldNrA
for ( fldNrB=fldNrA+1; fldNrB<=NF; fldNrB++ ) {
fldValB = $fldNrB
val_pairs[fldValA][fldValB]
val_pairs[fldValB][fldValA]
}
}
}
function descend(fldValA, fldValB) {
if ( !seen[fldValA]++ ) {
all_vals[fldValA]
for ( fldValB in val_pairs[fldValA] ) {
descend(fldValB)
}
}
}
END {
PROCINFO["sorted_in"] = "@ind_str_asc"
for ( fldValA in val_pairs ) {
delete all_vals
descend(fldValA)
if ( fldValA in all_vals ) {
sep = ""
for ( fldValB in all_vals ) {
printf "%s%s", sep, fldValB
sep = OFS
}
print ""
}
}
}
$ awk -f tst.awk file
a3 a7c a9 v1c v2c
a7 v5
原答案:
这是一个开始对数组的数组使用 GNU awk:
$ cat tst.awk
{
for ( fldNr=1; fldNr<=NF; fldNr++ ) {
fldVal = $fldNr
fldVals_rowNrs[fldVal][NR]
rowNrs_fldVals[NR][fldVal]
}
}
END {
for ( rowNr=1; rowNr<=NR; rowNr++ ) {
noOverlap[rowNr]
}
for ( rowNrA in rowNrs_fldVals ) {
for ( fldVal in rowNrs_fldVals[rowNrA] ) {
for ( rowNrB in fldVals_rowNrs[fldVal] ) {
if ( rowNrB > rowNrA ) {
overlap[rowNrA][rowNrB]
delete noOverlap[rowNrA]
delete noOverlap[rowNrB]
}
}
}
}
for ( rowNrA in overlap ) {
for ( rowNrB in overlap[rowNrA] ) {
print "Values overlap between lines:", rowNrA, rowNrB
}
}
for ( rowNr in noOverlap ) {
print "All unique values in line:", rowNr
}
}
$ awk -f tst.awk file
Values overlap between lines: 1 3
Values overlap between lines: 1 4
All unique values in line: 2
从那里我希望您需要实现一个(递归下降?)函数(我不会这样做)来调用该行以print "Values overlap between lines:", rowNrA, rowNrB
查找具有重叠值的所有行之间的所有共同值,并使用PROCINFO["sorted_in"]
以特定顺序打印它们。
由于您询问了有关递归函数的一些信息在评论中这里是用于不同目的的递归 awk 函数的示例(所有函数都有命名descend()
,但名称无关紧要):
- https://stackoverflow.com/a/46063483/1745001
- https://stackoverflow.com/a/42736174/1745001
- https://stackoverflow.com/a/32020697/1745001
- https://stackoverflow.com/a/47834902/1745001
希望这些能让您了解如何为此任务编写这样的函数。
答案2
这是一个红宝石来做到这一点:
ruby -e '
require "set"
line_map=Hash.new { |h,k| h[k]=[] }
num_map=Hash.new { |h,k| h[k]=Set.new() }
bucket=Hash.new { |h,k| h[k]=Set.new() }
$<.each {|line| line_all=line.chomp.split
line_all.each{|sym| line_map[sym] << $. }
num_map[$.].merge(line_all)
}
line_map.each{|k,v|
bucket[num_map[v[0]].all?{|ks| v.length==1 && line_map[ks][0]==v[0]}] << k
}
puts bucket[false].sort.join(" ")
puts bucket[true].sort.join(" ")
' file
印刷:
a3 a7c a9 v1c v2c
a7 v5