基于多行和列的分组

基于多行和列的分组

数据示例如下。前两列是 ID,第三列是频率。

1 2 99
2 3 62
4 5 80
4 4 98
5 5 79
6 1 98

第一列和第二列是同一个人或重复的 ID。例如,1、2、3、6是同一个人。 1==3 因为 1==2 & 2==3 等等。因此,数据可以这样分割。

人 1

1 2 99
2 3 62
6 1 98

人 2

4 5 80
4 4 98
5 5 79

我怎样才能像上面那样分割数据呢?这里必须跨行进行比较。这对我来说是令人困惑的部分。然后,在每个组中,我想根据第三列中的频率选择 ID。在这里,我获取频率最低的动物,以从另一个文件中消除这些 ID。优选的最终输出如下。

2 3 62
6 1 98
4 5 80
5 5 79 

我寻找答案,但对我来说,这似乎很复杂。也许有比分割数据更好的方法。任何想法,请。

答案1

解决你的第一个问题:以下是如何在每个 Unix 机器上的任何 bourne 派生 shell 中使用任何 awk+sort 来分割输入(我在我的 shebang 中使用 bash,但它不需要是 bash):

$ cat tst.sh
#!/usr/bin/env bash

awk '{ print $0 ORS $2, $1, $3 }' "${@:--}" |
sort -n -k1,1 -k2,2 |
awk '
    !seen[($1 > $2 ? $1 FS $2 : $2 FS $1)]++ {
        out = ""

        for ( i=1; i<=2; i++ ) {
            if ( $i in map ) {
                out = map[$i]
                break
            }
        }

        if ( out == "" ) {
            out = "person_" (++numPeople)
        }

        for ( i=1; i<=2; i++ ) {
            map[$i] = out
        }

        print >> out
        close(out)
    }
'

我们需要修改您发布的示例输入以包含中所述的行我的评论真正测试拆分是否有效:

$ cat file
1 2 99
2 3 62
4 5 80
4 4 98
5 5 79
6 1 98
7 8 99
9 10 98
9 7 97

$ ./tst.sh file

$ head person*
==> person_1 <==
1 2 99
1 6 98
2 3 62

==> person_2 <==
4 4 98
4 5 80
5 5 79

==> person_3 <==
7 8 99
7 9 97
9 10 98

上面假设每行中前 2 个 ID 的顺序并不重要,因为1 2 x相当于2 1 x

答案2

仅“分割数据”部分的部分答案。它使用 GNU-awk功能数组的数组所以gawk只是一个解决方案。

它按第 1 列和第 2 列中的重叠 ID 进行分组,并为每个人提供唯一的 ID,然后打印到名为的文件person_ID

gawk '
#id=array of arrays with unique ids UID and alias
#IDs aID as taken from the file: id[UID][aID]

#manually create first entry in line 1
NR==1 { id[1][$1]=1 ; id[1][$2]=1 ; next }

#on other input: scan array id for a match in aIDs
#use related UID it if match is found
FNR==NR {
  for (i=1 ; i<=length(id) ; i++ ) {
    if ($1 in id[i] || $2 in id[i] ) {
      id[i][$1]=i ; id[i][$2]=i ; next }
  }
#if no match was found, create a new UID:
  new=length(id)+1
  id[new][$1]=new ; id[new][$2]=new
}

#rerun through id arrays to check for doubles:
FNR!=NR && !b {
  for ( i in id ) {
    for ( j in id[i] ) {
      if ( seen[j] ) {
        for ( k in id[i] ) { id[seen[j]][k]=id[seen[j]][k] }
        delete id[i]
        }
       else { seen[j]=i }
    }
  }
  delete seen
#adjust UIDs as they may be out of order now to new id array nid,
#delete old id array:
  for ( i in id ) {++n ; for (j in id[i]) { nid[n][j]=id[i][j] } }
  delete id
  b=!b
}

#write to separate files per UID
FNR!=NR {
  for (i in nid) {
    if ($1 in nid[i] || $2 in nid[i] ) { print > "person_"i }
  }
}

#This is just to print the aID vs UID map
END {
  for (i in nid) {
    print "aIDs for person UID=",i ; b=1
    for (j in nid[i]) {
      if (b) {printf j ; b=0}
      else {printf ","j}
    }
  print ""
  }
}
' infile infile

现在,对于消除线条的问题,我建议采用这种非常简单的方法:

利用person_i上面创建的文件,并为每个文件选择字段 3 中具有最小值的行。将这些行写入delete_me文件并在原始文件上使用反向 grep:

for file in person_* ; do
  sort -n -k3 ${file} | head -n1
done > delete_me
grep -xvf delete_me original

sort当涉及到相同的数字或类似的数字时,仅通过最小选择来完成,因此不进行细化。使用-xforgrep确保匹配必须完全覆盖整行(否则1 2 3将匹配1 2 3and 例如1 2 33


以下变体将仅按 id 对人员进行分组,并筛选出第 3 列中具有最大值的行以找到唯一的人员。在输入文件的第二次读取中,仅打印没有这些最大行数的行 - 因此这是没有额外文件的单脚本解决方案:

#id=array of arrays with unique ids UID and alias
#IDs aID as taken from the file: id[UID][aID]

#manually create first entry in line 1, print to "person_1"
NR==1 { id[1][$1]=1 ; id[1][$2]=1 ; next }
#manually create first entry for line deletion selection
NR==1 { max[1]=$3 ; line[1]=$0 }

#on other input: scan array id for a match in aIDs
#use related UID it if match is found
FNR==NR {
  for (i=1 ; i<=length(id) ; i++ ) {
    if ($1 in id[i] || $2 in id[i] ) {
      id[i][$1]=i ; id[i][$2]=i
#select line for deletion
      if ($3>=max[i]) { max[i]=$3 ; line[i]=$0 }
      ; next
    }
  }
#if no match was found, create a new UID:
  new=length(id)+1
  id[new][$1]=new ; id[new][$2]=new
  max[new]=$3 ; line[new]=$0
}

#rerun through id arrays to check for doubles:
FNR!=NR && !b {
  for ( i in id ) {
    for ( j in id[i] ) {
      if ( seen[j] ) {
        for ( k in id[i] ) { id[seen[j]][k]=id[seen[j]][k] }
        delete id[i]
#adjust line deletion selection:
        if (max[seen[j]]>=max[i]) { delete line[i]} else {delete line[seen[j]]}
      }
       else { seen[j]=i }
    }
  }
  delete seen
#adjust UIDs as they may be out of order now:
  for ( i in id ) {++n ; for (j in id[i]) { nid[n][j]=id[i][j] } }
  delete id
#swap line deletion marker from array element to index for better processability
  for (i in line) { line[line[i]]="" ; delete line[i]}
  delete max
#set flag for running this block once only
  b=!b
}

#write to separate files per UID (commented out)
#FNR!=NR {
#  for (i in nid) {
#    if ($1 in nid[i] || $2 in nid[i] ) {print > "person_"i}
#  }
#}
#print lines that have not been selected for deletion
#to STDOUT use the alternative to print it to a separate file
FNR!=NR && ! ($0 in line)
#alternative:
#FNR!=NR && ! ($0 in line) {print > "myoutfile"}


#This is just to print the aID vs UID map
END { print "\n------\nUID vs aID map:\n"
  for (i in nid) {
    print "aIDs for person UID=",i ; b=1
    for (j in nid[i]) {
      if (b) {printf j ; b=0}
      else {printf ","j}
    }
  print ""
  }
}

运行为awk -f script.awk infile infile,即读取 infile 两次。

相关内容