我想比较文件一和文件二中的列。其中 file1 的第 2 列应与 file2 的第 1 列或第 2 列不匹配,并打印文件 1 的输出。
文件1。
猫测试.head20.R2.fastq.tab
@0_1_2367_1112_211 ENSG00000165837 GAAATTAAGTTATAATTTTCATGGGACATTTTCATCACTGTTGACACAGTTTCAAGCATTCCATCATGTTATTTTGACTCTTTTTCTTTTTTTTTTCTTT + @6@CDCFFEDEIJIIJJJFBFHIIJJJJJGC?CDDDDDDFEDGBFFFFHEDFFBBBDDDDDDDDBDDD@@@@CDDDDDEHHJJJGJIIIGIJJJIIIFCH
@10000000_0_0_0_0 rupesh TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?DDECGGIEDDDDDDHHJJJJJJIGIIIJED?CB5@CFFHHHCFF
@10000001_0_3150_2465_134 ENSG00000137860 GCCTCTCAAGTAGCTGGGATTACAGGCACCTGCCACCACGCCCAGCCAATTTTTGTATTTTTAGTAGAGACAATTTCACTATGTTGGCCAGGCTGGTCTT + DEDDB>HJIGHFJJJIGFFFHJJJJJJJJIIGHHFFDDCCCIIJJJJJJJJJJJJJIGIJJJHJIFHHGJJIIHEEEDDDDDC>?@DDDEEEDFFFFFFC
@10000002_0_2947_952_158 ENSG00000028203 CCCCCAGGACCAGCTGCTGTTTTGTGATGACTGCGATCGGGGTTACCACATGTACTGCCTGAGTCCCCCCATGGCGGAGCCCCCGGAAGGGAGCTGGAGC + JFHHEDDB;;63JJJIJJJHHFIIJIHGHHHHGHHHJJIIIEEEIJJHHHJJHHFFJIJJJJJJJJJJJJJJHDDDDDDFGBB?8BDDDEDDDDDDDDCC
@10000003_0_8902_3193_186 ENSG00000177051 CAAGGCCAGAGAGACAAATAATGCCTCATGTCCCACTGCTTTAAAATTACATTAATTTATAAAATGGCCACTATGGGCTCTTTTTGACTGTTTCTCGGAG + HGEDDDDDDDDDDDCDDB@<DDDD<>CDDDDB@>DDDEIJJJIGJJIIIIGIJHHHGEEIJJJJJJJJIHJJJJJJIIJIIIJJIJJJJIIIIHEDDDCC
@5000345_0_3_0_0 ENSG00000178057 TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?
文件2
猫 fusions.head16.R2.fastq.tab
ENSG00000137860 ENSG00000165837 1431 1598 0:0:0 0:0:0 0/2 CAGGTCATCTGCTCCTATCTCCTAAGGCCCATGGTTTTCATGATGGGTGTAGAGTGGACAGACTGTCCAATGGTGGCTGAGATGGTGGGAATCAAGTTCT + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
ENSG00000177051 ENSG00000134905 277 433 0:0:0 0:0:0 451/2 CTTCACTGCACAGCCAGGGTGAGCCTCGCTGGGAAGGTGCAGGTGACTCGTGCCTGTCGGGGAGCCCGTCCTGTCCGTACAAAACATGTGCCAGGCAAGG + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
ENSG00000137860 ENSG00000165837 2761 2951 0:0:0 0:0:0 2/2 AAACAATCTTACGGATTAAGAGGAGACGTGAAGCTCAAAAGTTAACAGAGATGACCAGTTTCACATTTCATTTAATGAGCAAACCAACACCTGAGAAGCC + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
ENSG00000028203 ENSG00000157766 183 411 0:0:0 0:0:0 101/2 TTCTTTGTCACCAAAAACAGAAAAATGCACAACAGAGGGACAACAAAAGCCTCCTACAAGAGTCCTACCAAAATACCTGGGATATAGTAATCACTCAATG + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
期望的输出:
@10000000_0_0_0_0 rupesh TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?DDECGGIEDDDDDDHHJJJJJJIGIIIJED?CB5@CFFHHHCFF
@5000345_0_3_0_0 ENSG00000178057 TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?
到目前为止已尝试从第二个文件打印不匹配的条目,但不知道如何从 file1 打印?
awk '{k=$2} NR==FNR{a[k]; next} !(k in a)' test.head20.R2.fastq.tab fusions.head16.R2.fastq.tab
上面代码的输出是我不想要的:
ENSG00000177051 ENSG00000134905 277 433 0:0:0 0:0:0 451/2 CTTCACTGCACAGCCAGGGTGAGCCTCGCTGGGAAGGTGCAGGTGACTCGTGCCTGTCGGGGAGCCCGTCCTGTCCGTACAAAACATGTGCCAGGCAAGG + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
ENSG00000028203 ENSG00000157766 183 411 0:0:0 0:0:0 101/2 TTCTTTGTCACCAAAAACAGAAAAATGCACAACAGAGGGACAACAAAAGCCTCCTACAAGAGTCCTACCAAAATACCTGGGATATAGTAATCACTCAATG + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
答案1
$ awk '
NR==FNR {a[$1]++; a[$2]++; next};
!($2 in a)' fusions.head16.R2.fastq.tab test.head20.R2.fastq.tab
@10000000_0_0_0_0 rupesh TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?DDECGGIEDDDDDDHHJJJJJJIGIIIJED?CB5@CFFHHHCFF
@5000345_0_3_0_0 ENSG00000178057 TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?
如果您阅读排除文件 ( fusions.head16.R2.fastq.tab
) ,这比我最初想象的更简单、更容易前数据文件 ( test.head20.R2.fastq.tab
)。
这会读取第一个文件并使用数组来存储在字段和a
中找到的标识符。$1
$2
然后,对于第二个文件(以及后续文件,如果有)的每一行,如果字段 $2 不在 array 中a
,它将打印该行。
答案2
该解决方案可行,但需要几个步骤和一个中间文件。
步骤 1:检索要从文件 1 中删除其记录的 ID 列表:
awk -F' ' '{print $1 "\n" $2}' fusions.head16.R2.fastq.tab > remove_list
步骤 2:从文件 1 中检索不包含 remove_list 中的 ID 的条目。
awk -F' ' 'NR==FNR{a[$1];next} !($2 in a)' remove_list test.head20.R.fastq.tab
输出:
@10000000_0_0_0_0 rupesh TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?DDECGGIEDDDDDDHHJJJJJJIGIIIJED?CB5@CFFHHHCFF
@5000345_0_3_0_0 ENSG00000178057 TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?