根据两个文件的第一列排除匹配的行

Question 1

假设文件按第一个字段排序，您可以使用以下命令从第一个文件中提取第一个字段未出现在第二个文件中的所有记录

$ join -v 1 file1 file2
1856473 1 rs6684487 G A A 12387 1.02222 0.0836593 0.262689 0.79279

要保留制表符分隔符和标题：

$ head -n 1 file1; join -t $'\t' -v 1 file1 file2
BP      CHR     SNP     REF     ALT     A1      OBS_CT  OR      LOG(OR)_SE      Z_STAT  P
1856473 1       rs6684487       G       A       A       12387   1.02222 0.0836593       0.262689        0.79279

要同时使用第一个和第二个字段进行匹配，请根据两个文件中的这些字段创建一个新的组合第一个字段并联接到该字段，然后删除临时联接字段。这基本上实现了类似于装饰-排序-取消装饰，但使用关系 JOIN 操作进行排序。

以下代码假定 shell 能够使用<(...).

$ head -n 1 file1; join -t $'\t' -v 1 <( awk -F '\t' 'BEGIN { OFS=FS } { print $1 "_" $2, $0 }' file1 ) <( awk -F '\t' 'BEGIN { OFS=FS } { print $1 "_" $2, $0 }' file2 ) | cut -f 2-
BP      CHR     SNP     REF     ALT     A1      OBS_CT  OR      LOG(OR)_SE      Z_STAT  P
1856473 1       rs6684487       G       A       A       12387   1.02222 0.0836593       0.262689        0.79279

或者，使用辅助 shell 函数使命令更易于阅读，

$ decorate () {  awk -F '\t' 'BEGIN { OFS=FS } { print $1 "_" $2, $0 }' "$1"; }
$ head -n 1 file1; join -t $'\t' -v 1 <( decorate file1 ) <( decorate file2 ) | cut -f 2-
BP      CHR     SNP     REF     ALT     A1      OBS_CT  OR      LOG(OR)_SE      Z_STAT  P
1856473 1       rs6684487       G       A       A       12387   1.02222 0.0836593       0.262689        0.79279

Answer

假设文件按第一个字段排序，您可以使用以下命令从第一个文件中提取第一个字段未出现在第二个文件中的所有记录

$ join -v 1 file1 file2
1856473 1 rs6684487 G A A 12387 1.02222 0.0836593 0.262689 0.79279

要保留制表符分隔符和标题：

$ head -n 1 file1; join -t $'\t' -v 1 file1 file2
BP      CHR     SNP     REF     ALT     A1      OBS_CT  OR      LOG(OR)_SE      Z_STAT  P
1856473 1       rs6684487       G       A       A       12387   1.02222 0.0836593       0.262689        0.79279

要同时使用第一个和第二个字段进行匹配，请根据两个文件中的这些字段创建一个新的组合第一个字段并联接到该字段，然后删除临时联接字段。这基本上实现了类似于装饰-排序-取消装饰，但使用关系 JOIN 操作进行排序。

以下代码假定 shell 能够使用<(...).

$ head -n 1 file1; join -t $'\t' -v 1 <( awk -F '\t' 'BEGIN { OFS=FS } { print $1 "_" $2, $0 }' file1 ) <( awk -F '\t' 'BEGIN { OFS=FS } { print $1 "_" $2, $0 }' file2 ) | cut -f 2-
BP      CHR     SNP     REF     ALT     A1      OBS_CT  OR      LOG(OR)_SE      Z_STAT  P
1856473 1       rs6684487       G       A       A       12387   1.02222 0.0836593       0.262689        0.79279

或者，使用辅助 shell 函数使命令更易于阅读，

$ decorate () {  awk -F '\t' 'BEGIN { OFS=FS } { print $1 "_" $2, $0 }' "$1"; }
$ head -n 1 file1; join -t $'\t' -v 1 <( decorate file1 ) <( decorate file2 ) | cut -f 2-
BP      CHR     SNP     REF     ALT     A1      OBS_CT  OR      LOG(OR)_SE      Z_STAT  P
1856473 1       rs6684487       G       A       A       12387   1.02222 0.0836593       0.262689        0.79279

Question 2

一种选择是使用awk：

awk '
    # set the input Field Seperator to a Tab
    BEGIN  { FS="\t" }
 
    # store column#1,column#2 of file2 into associated array bp_file2
    NR==FNR{ bp_file2[$1, $2]; next }

    # do not print lines of file1 if column#1 was in the array
                      # with FNR==1 we are printing the first header line too
    !(($1, $2) in bp_file2) || FNR==1

' file2 file1

Answer

一种选择是使用awk：

awk '
    # set the input Field Seperator to a Tab
    BEGIN  { FS="\t" }
 
    # store column#1,column#2 of file2 into associated array bp_file2
    NR==FNR{ bp_file2[$1, $2]; next }

    # do not print lines of file1 if column#1 was in the array
                      # with FNR==1 we are printing the first header line too
    !(($1, $2) in bp_file2) || FNR==1

' file2 file1

Question 3

一个Python解决方案。可能比其他人更长，但也更具可读性（仅对我而言？）：

file1 = '1'
file2 = '2'
separator = '\t'

list2=[]

# first building up a list of IDs from file2:
with open(file2) as f:
    for line in f:
        if line[0].isdigit(): # only process lines which start with number
            list2.append(line.split(separator)[0])

# then go through file1 and check IDs from the previously built list 
with open(file1) as f:
    print(f.readline(), end='') # printing out header
    for line in f:
        if not line.split(separator)[0] in list2:
            # print out line which IDs are not in list2 (not in file2)
            print(line, end='')

并像这样运行它：

python3 file.py > 3

Answer

一个Python解决方案。可能比其他人更长，但也更具可读性（仅对我而言？）：

file1 = '1'
file2 = '2'
separator = '\t'

list2=[]

# first building up a list of IDs from file2:
with open(file2) as f:
    for line in f:
        if line[0].isdigit(): # only process lines which start with number
            list2.append(line.split(separator)[0])

# then go through file1 and check IDs from the previously built list 
with open(file1) as f:
    print(f.readline(), end='') # printing out header
    for line in f:
        if not line.split(separator)[0] in list2:
            # print out line which IDs are not in list2 (not in file2)
            print(line, end='')

并像这样运行它：

python3 file.py > 3

Question 4

awk 'BEGIN{print "BP  CHR SNP REF ALT A1  OBS_CT  OR  LOG(OR)_SE  Z_STAT  P"}NR==FNR{a[$1];next}!($1 in a){print $0}' file2 file1

输出

BP  CHR SNP REF ALT A1  OBS_CT  OR  LOG(OR)_SE  Z_STAT  P
1856473 1   rs6684487   G   A   A   12387   1.02222 0.0836593   0.262689    0.79279

Python

#!/usr/bin/python
import re
f1=open('/home/praveen/file1','r')
f2=open('/home/praveen/file2','r')
f3=open('/home/praveen/file3','w')
f1.readline()
f2.readline()

for i in f2:
    i_split=i.split(' ')
#    print i.split(' ')
    for j in f1:
        j_split=j.split(' ')
        if (i_split[0] != j_split[0]):
            str_append="BP  CHR SNP REF ALT A1  OBS_CT  OR  LOG(OR)_SE  Z_STAT  P"
            appnddata="{0}\n{1}\n".format(str_append,j.strip())
            f3.write(appnddata)

输出

BP  CHR SNP REF ALT A1  OBS_CT  OR  LOG(OR)_SE  Z_STAT  P
1856473 1   rs6684487   G   A   A   12387   1.02222 0.0836593   0.262689    0.79279

Answer

awk 'BEGIN{print "BP  CHR SNP REF ALT A1  OBS_CT  OR  LOG(OR)_SE  Z_STAT  P"}NR==FNR{a[$1];next}!($1 in a){print $0}' file2 file1

输出

BP  CHR SNP REF ALT A1  OBS_CT  OR  LOG(OR)_SE  Z_STAT  P
1856473 1   rs6684487   G   A   A   12387   1.02222 0.0836593   0.262689    0.79279

Python

#!/usr/bin/python
import re
f1=open('/home/praveen/file1','r')
f2=open('/home/praveen/file2','r')
f3=open('/home/praveen/file3','w')
f1.readline()
f2.readline()

for i in f2:
    i_split=i.split(' ')
#    print i.split(' ')
    for j in f1:
        j_split=j.split(' ')
        if (i_split[0] != j_split[0]):
            str_append="BP  CHR SNP REF ALT A1  OBS_CT  OR  LOG(OR)_SE  Z_STAT  P"
            appnddata="{0}\n{1}\n".format(str_append,j.strip())
            f3.write(appnddata)

输出

BP  CHR SNP REF ALT A1  OBS_CT  OR  LOG(OR)_SE  Z_STAT  P
1856473 1   rs6684487   G   A   A   12387   1.02222 0.0836593   0.262689    0.79279

根据两个文件的第一列排除匹配的行

答案1

答案2

答案3

答案4

相关内容