我有一个看起来像这样的文件。第一行是标题。
"variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos"
"chr10_100000235_C_T_b38" "chr10" "100000235" "C" "T" "chr10" 101759992
"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10"
"chr10_100004827_A_C_b38" "chr10" "100004827" "A" "C" "chr10" 101764584
"chr10_100005358_G_C_b38" "chr10" "100005358" "G" "C" "chr10" 101765115
"chr10_100005711_G_A_b38" "chr10" "100005711" "G" "A" "chr10" 101765468
"chr10_100006780_C_T_b38" "chr10" "100006780" "C" "T" "chr10" 101766537
"chr10_100007241_C_T_b38" "chr10" "100007241" "C" "T" "chr10" 101766998
"chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10"
"chr10_100009013_G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770
如何识别最后一列中的空字段?我尝试了下面的命令:
awk '$7==" "' file.txt > blanks.txt
awk '{if($7==" ") print}' file.txt > blanks.txt
两者都给出了空文件。
Blanks.txt 的结果应该是
"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10"
"chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10"
答案1
此答案的最后一个替代方案对接受的内容更加严格,并且独立于由制表符和/或空格分隔的字段。
但是,首先:
如果最后一个字段为空,则只有 6 个字段(如果以空格或制表符分隔)。如果你想打印这些行,可以这样做:
$ awk ' NF<7 {print}' infile
"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10"
"chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10"
该{print}
命令实际上并不是必需的,因为 awk 默认情况下会打印为 true 的表达式,并在下一个解决方案中删除(感谢费利克斯JN)。
如果您还需要标头,请添加:
$ awk '(NF<7) || (NR==1)' infile
"variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos"
"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10"
"chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10"
并且,如果您想保留具有足够字段的行,请执行以下操作:
$ awk '(NF>=7) || (NR==1)' infile
"variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos"
"chr10_100000235_C_T_b38" "chr10" "100000235" "C" "T" "chr10" 101759992
"chr10_100004827_A_C_b38" "chr10" "100004827" "A" "C" "chr10" 101764584
"chr10_100005358_G_C_b38" "chr10" "100005358" "G" "C" "chr10" 101765115
"chr10_100005711_G_A_b38" "chr10" "100005711" "G" "A" "chr10" 101765468
"chr10_100006780_C_T_b38" "chr10" "100006780" "C" "T" "chr10" 101766537
"chr10_100007241_C_T_b38" "chr10" "100007241" "C" "T" "chr10" 101766998
"chr10_100009013_G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770
如果您需要一个不依赖于缺少最后一个文件这一事实的解决方案,而是确保行末尾有一个尾随数字,请使用:
$ awk '/[0-9]+[ \t]*$/ || (NR==1)' infile
"variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos"
"chr10_100000235_C_T_b38" "chr10" "100000235" "C" "T" "chr10" 101759992
"chr10_100004827_A_C_b38" "chr10" "100004827" "A" "C" "chr10" 101764584
"chr10_100005358_G_C_b38" "chr10" "100005358" "G" "C" "chr10" 101765115
"chr10_100005711_G_A_b38" "chr10" "100005711" "G" "A" "chr10" 101765468
"chr10_100006780_C_T_b38" "chr10" "100006780" "C" "T" "chr10" 101766537
"chr10_100007241_C_T_b38" "chr10" "100007241" "C" "T" "chr10" 101766998
"chr10 100009013_G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770
"chr10 100009013 G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770
"chr10_100009013_G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770
这不会受到任何其他字段缺失的影响,并且与使用哪个字段分隔符(空格和/或制表符)无关。
假设最后一个字段是数字不是用双引号括起来,但如果需要,很容易更改。
并且,为了严格遵守您的问题所要求的输出:
$ awk '!/[0-9]+[ \t]*$/ && NR>1' infile
"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10"
"chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10"