过滤多个文件以保留具有共同第一列的行

过滤多个文件以保留具有共同第一列的行

我有 24 个文件,我想过滤这些文件,以便每个文件仅包含每个文件第一列中具有公共字符串的行(在示例中,geneA 和geneF 是 Column1 中每个文件唯一公共的字符串)。输出应保留 3 列。文件以制表符分隔。

我的文件看起来像:

文件1.txt

Column1 Column2 Column3
geneA   11  C
geneB   34  T
geneC   22  A
geneD   23  A
geneE   2   G
geneF   34  A

文件2.txt

Column1 Column2 Column3
geneA   34  A
geneF   67  G
geneG   77  A
geneZ   45  G
geneY   99  T

文件24.txt

Column1 Column2 Column3
geneA   22  A
geneF   7   T
geneL   34  C
geneK   66  A
geneM   34  T
geneP   47  G

我想要的输出是:

文件1.txt

Column1 Column2 Column3
geneA   11  C
geneF   34  A

文件2.txt

Column1 Column2 Column3
geneA   34  A
geneF   67  G

文件24.txt

Column1 Column2 Column3
geneA   22  A
geneF   7   T

答案1

使用 GNU awk 进行“就地”编辑,即使给定的 column1 值可以在输入文件中多次出现,这也将起作用:

$ cat tst.awk
BEGIN {
    for (fileNr=1; fileNr<ARGC; fileNr++) {
        file = ARGV[fileNr]
        delete thisFile
        while ( (getline < file) > 0 ) {
            thisFile[$1]
            if ( fileNr == 1 ) {
                common[$1]
            }
        }
        close(file)
        for ( val in common ) {
            if ( !(val in thisFile) ) {
                delete common[val]
            }
        }
    }
}
(FNR == 1) || ($1 in common)

$ awk -i inplace -f tst.awk file{1..3}

$ tail -n +1 file{1..3}
==> file1 <==
Column1 Column2 Column3
geneA   11  C
geneF   34  A

==> file2 <==
Column1 Column2 Column3
geneA   34  A
geneF   67  G

==> file3 <==
Column1 Column2 Column3
geneA   22  A
geneF   7   T

但如果 column1 值只能在每个文件中出现一次,那么它可以更简短:

$ awk -i inplace -v comm="$(cut -f1 file{1..3} | sort | uniq -c | awk '$1==3')" '
    BEGIN{split(comm,tmp); for (i in tmp) common[tmp[i]]} (FNR == 1) || ($1 in common)
' file{1..3}

或者如果您没有具有就地编辑功能的 awk:

$ comm="$(cut -f1 file{1..3} | sort | uniq -c | awk '$1==3')"
$ for file in file{1..3}; do
    awk -v comm="$comm" '
        BEGIN{split(comm,tmp); for (i in tmp) common[tmp[i]]} (FNR == 1) || ($1 in common)
    ' "$file" > tmp && mv tmp "$file"
done

答案2

一个衬垫(对于使用 join 的 /bin/sh 或 /bin/bash):

tmp=$(cat file1); for f in file{2..3}; do tmp=$(join -j1 -o1.1 <(echo "${tmp}"|sort) <(sort $f)); done; for f in file{1..3}; do echo "> File $f:"; for i in $(echo $tmp); do grep "^$i\s" $f; done; done

输出:

> File file1:
Column1 Column2 Column3
geneA   11      C
geneF   34      A
> File file2:
Column1 Column2 Column3
geneA   34      A
geneF   67      G
> File file3:
Column1 Column2 Column3
geneA   22      A
geneF   7       T

解释:

#!/bin/sh
# find first column members existing in each file
tmp=$(cat file1);
for f in file{2..3}; do
   tmp=$(join -j1 -o1.1 <(echo "${tmp}"|sort) <(sort $f)); 
done;
#
# going through files and printing lines containing found members
for f in file{1..3}; do
    echo "> File $f:";
    for i in $(echo $tmp); do
        grep "^$i\s" $f;
    done;
done

PS 它只打印结果,但不重写文件,但可以轻松更改。

相关内容