我有一个制表符分隔的多列文件,其中第 19 列如下所示:
gaA
gGg
Att
gtC
gGa
gcC
ccG
cTc
.
.
.
and so on
我只想 grep 大写字符,所以我使用了:
cut -f19 1.table | grep -e '[[:upper:]]' -o
输出是:
A
G
A
C
G
C
G
T
.
.
.
and so on
但我不想在 grep 之前使用 cut 。我现在有两个问题:
- 有什么方法可以从第 19 列开始 grep 而不是使用 cut 吗?或者 grep 中是否有任何选项或参数来指定列?
- 我想将 grep 结果输出作为新列放入 1.table 文件中?或者如何将 grep 结果输出作为 1.table 文件中的新列(如第 20 列)进行管道传输?
以下是 1.table 的输入行(1.table 也有标题):
#CHROM POS ID REF ALT QUAL AC AN AF DP ExonicFunc.refGene Func.refGene AAChange.refGene Gene.refGene GeneDetail.refGene GENEINFO EFF refcodon ExAC_AFR ExAC_ALL ExAC_AMR ExAC_EAS ExAC_FIN ExAC_NFE ExAC_OTH ExAC_SAS gnomAD_exome_AFR gnomAD_exome_ALL gnomAD_exome_AMR gnomAD_exome_ASJ gnomAD_exome_EAS gnomAD_exome_FIN gnomAD_exome_NFE gnomAD_exome_OTH gnomAD_exome_SAS gnomAD_genome_AFR gnomAD_genome_ALL gnomAD_genome_AMR gnomAD_genome_ASJ gnomAD_genome_EAS gnomAD_genome_FIN gnomAD_genome_NFE gnomAD_genome_OTH 1000g2015aug_all esp6500siv2_all CADD_phred CADD_raw CADD_raw_rankscore CAF DANN_rankscore DANN_score Eigen Eigen-PC-raw Eigen-raw Eigen_coding_or_noncoding FATHMM_coding FATHMM_converted_rankscore FATHMM_noncoding FATHMM_pred FATHMM_score FS GTEx_V6_gene GTEx_V6_tissue GWAVA_region_scoreGWAVA_tss_score GWAVA_unmatched_score GenoCanyon_score GenoCanyon_score_rankscore Interpro_domain LRT_converted_rankscore LRT_pred LRT_score MetaLR_pred MetaLR_rankscore MetaLR_score MetaSVM_pred MetaSVM_rankscore MetaSVM_score MutationAssessor_pred MutationAssessor_score MutationAssessor_score_rankscore MutationTaster_converted_rankscore MutationTaster_pred MutationTaster_score PROVEAN_converted_rankscore PROVEAN_pred PROVEAN_score Polyphen2_HDIV_pred Polyphen2_HDIV_rankscore Polyphen2_HDIV_score Polyphen2_HVAR_pred Polyphen2_HVAR_rankscore Polyphen2_HVAR_score QD SIFT_converted_rankscore SIFT_pred SIFT_score SiPhy_29way_logOdds SiPhy_29way_logOdds_rankscore VC VEST3_rankscore VEST3_score WGT avsnp147=rs28410799 integrated_confidence_value integrated_fitCons_score integrated_fitCons_score_rankscorephastCons100way_vertebrate phastCons100way_vertebrate_rankscore phastCons20way_mammalian phastCons20way_mammalian_rankscorephyloP100way_vertebrate phyloP100way_vertebrate_rankscore phyloP20way_mammalian phyloP20way_mammalian_rankscore CLINSIG CLNACC CLNDBN CLNDSDB CLNDSDBID GT AD DP GQ PL GT AD DP GQ PL GT AD DP GQPL
chr1 13115765 rs141111983 C T 2280.92 3 6 0.5 153 synonymous_SNV exonic HNRNPCL2:NM_001136561:exon2:c.G636A:p.E212E HNRNPCL2 0 HNRNPCL2:440563 SYNONYMOUS_CODING(LOW|SILENT|gaG/gaA|E212|293|HNRNPCL2|protein_coding|CODING|ENST00000621994|2|T),NEXT_PROT[coiled-coil_region](LOW||||293|HNRNPCL2|protein_coding|CODING||2|T),INTRON(MODIFIER||||478|WI2-3308P17.2|protein_coding|CODING|ENST00000622351|1|T),INTRON(MODIFIER||||478|PRAMEF26|protein_coding|CODING|ENST00000621259|4|T) gaG gaA E212 0.4772 0.4933 0.4993 0.497 0.5 0.4918 0.4967 0.4996 0.4846 0.4959 0.4998 0.4969 0.499 0.4999 0.4939 0.4969 0.4998 0.4939 0.4125 0.4867 0.1888 0.4981 0.4997 0.3321 0.4604 0 0 0 0 0 0 0 0-0.3847 -0.3847-PC-raw -0.3847-raw 0 0.02308 0 0.87915 0 0 18.131 0 0 0.43 0.23 182 000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000 0 0 0 0 0 14.91 0 0 0 0 0 SNV 0 0 1 rs141111983=rs28410799 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B00H7EW=0/1 38,18 56 99 722,0,1577 B00H7EX=0/1 31,29 60 99 1166,0,1211 B00H7EY=0/1 26,11 3799 423,0,1098
chr1 13115766 rs150951326 T C 2325.92 3 6 0.5 155 nonsynonymous_SNV exonic HNRNPCL2:NM_001136561:exon2:c.A635G:p.E212G HNRNPCL2 0 HNRNPCL2:440563 NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gAg/gGg|E212G|293|HNRNPCL2|protein_coding|CODING|ENST00000621994|2|C),NEXT_PROT[coiled-coil_region](LOW||||293|HNRNPCL2|protein_coding|CODING||2|C),INTRON(MODIFIER||||478|WI2-3308P17.2|protein_coding|CODING|ENST00000622351|1|C),INTRON(MODIFIER||||478|PRAMEF26|protein_coding|CODING|ENST00000621259|4|C) gAg gGg E212G 0.4775 0.4934 0.4993 0.4972 0.5 0.4919 0.4967 0.4996 0.4851 0.496 0.4998 0.4969 0.4991 0.4999 0.494 0.4969 0.4998 0.494 0.4127 0.4867 0.1875 0.4981 0.4997 0.3323 0.4603 0 0 0.286 -0.453 0.058 0 0.019 0.324 -0.4897 -0.4897-PC-raw -0.4897-raw n 0.01015 0 0.80402 0 0 22.504 0 00.43 0.23 182 0 0.029 0 0 0 0 0 0 0 0 0 0 0 0 000 0 0 0 0 B 0.026 0 B 0.013 0 15.01 0 0 0 0 0 SNV 0.089 0.091 1 rs150951326=rs28410799 0 0.075 0.013 0.947 0.327 0.005 0.09 -0.854 0.044 -0.972 0.023 0 0 0 0 0 B00H7EW=0/1 38,19 57 99 764,0,1574 B00H7EX=0/1 31,30 61 991166,0,1211 B00H7EY=0/1 26,11 37 99 426,0,1056
chr1 13392320 rs767291041 C A 96.12 1 4 0.25 10 nonsynonymous_SNV exonic PRAMEF16:NM_001045480:exon3:c.C1243A:p.P415T,PRAMEF17:NM_001099851:exon3:c.C1243A:p.P415T PRAMEF16,PRAMEF17 0 PRAMEF17:391004 NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Cct/Act|P415T|474|PRAMEF17|protein_coding|CODING|ENST00000376098|3|A) Cct Act P415T 0 0.001 0 0 0 0.002 0 0 0 0.0006 0.0016 0 0 0 0.0009 0.0008 0 0 0.0002 0 0 0 0 0.0007 0 0 0 0 0 0 0 0 0 -0.1177 -0.1177-PC-raw -0.1177-raw 0 0.02548 0 0.24739 0 0 0 0 0 0 0 0 0 0 000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000 0 0 0 10.68 0 0 0 0 0 SNV 0 0 1 rs782058522=rs28410799 000 0 0 0 0 0 0 0 0 0 0 0 0 0 B00H7EW=0/1 4,5 999 123,0,100 B00H7EX=0/0 1,0 1 3 0,3,30 B00H7EY=./. 0,0 0 0 0,0,0
chr1 13392320 rs767291041 C A 70.13 1 6 0.167 37 nonsynonymous_SNV exonic PRAMEF17:NM_001099851:exon3:c.C1243A:p.P415T PRAMEF17 0 PRAMEF17:391004 NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Cct/Act|P415T|474|PRAMEF17|protein_coding|CODING|ENST00000376098|3|A) Cct Act P415T 0.0048 0.0006 0 0 0 0 0 00.0002 0.0005 0.0009 0 0 0 0.0007 0.0006 0 0 0.0002 0 0 0 0 0.0004 0 0022.7 3.18 0.442 0 0.453 0.988 -0.1208 -0.1208-PC-raw -0.1208-raw c 0.0207 0.12 0.11274 T 2.72 000 0 0 0 0 0.061 0 0.843 D 0 T 0.356 0.094 T 0.045 -1.097 H 3.83 0.957 0.09 N 1 0.954 D -7.63 D 0.899 1 D 0.875 0.998 11.69 0.784 D 0.001 5.599 0.165 SNV 0.353 0.293 1 rs767291041=rs28410799 0 0.487 0.133 0.019 0.194 0.031 0.148 1.655 0.367 0.621 0.289 0 0 0 0 0 B00H7EW=0/1 3,3 6 94 101,0,94 B00H7EX=0/0 13,0 13 39 0,39,442 B00H7EY=0/0 18,0 18 48 0,48,720
答案1
grep 中有没有选项或参数来指定列?
grep没有字段分隔符选项。
使用以下内容awk相反的方法:
awk -F'\t' -v OFS='\t' '{match($19,/[A-Z]+/); $20=substr($19,RSTART,RLENGTH) FS $20}1' 1.table
match($19,/[A-Z]+/)
- 捕获第 19 字段内的大写字母
$20=substr($19,RSTART,RLENGTH) FS $20
- 从中提取匹配的大写字母19th 字段并将其插入为20第一个字段值
答案2
回答你关于如何做到这一点的字面问题grep
独自的。即使grep
没有为此设计,但使用 GNUgrep
并使用 PCRE 支持构建,您可以这样做:
grep -Po '(?:^(?:[^\t]*\t){18}|\G)[^\t]*?\K[[:upper:]]'
即搜索<not-TABs><tab>
行首或上一个匹配项末尾的 18 个序列 ( \G
),后跟尽可能少的非制表符(因此我们仍在第 19 个字段),后跟大写字母角色,但是\K
我们重置了匹配的大写字符之前的部分。
所以对于这样的输入:
X<tab>X<tab>....<tab>AbC<tab>X<tab>...
它会报告:
A
C
就像你的cut | grep
做法一样。
如果您只对第 19 字段中的第一个大写字符感兴趣,可以将其简化为:
grep -Po '^(?:[^\t]*\t){18}[^\t]*?\K[[:upper:]]'
将其插入为第 20 个柱子,你可以这样做:
paste <(cut -f1-19 < file) <(grep ...above < file) <(cut -f20- < file) > newfile
或者将其插入为最后一列:
grep... < file | paste file - > newfile
答案3
和sed
你一起可以做到
sed '/^#/!s/\([^ ]* *\)\{18\}[a-z]*\([A-Z]\).*/& \2/'
也就是说,对于所有不以#
(/^#/!
选择器)开头的行,在 18 个非空格和空格的组合之后,将大写字母标记为 以便\(\)
以后引用它,将整行本身“替换”,并用找到的大写字母附加空格信。
如果您喜欢扩展正则表达式,也可以使用
sed -E '/^#/!s/([^ ]* *){18}[a-z]*([A-Z]).*/& \2/'
如果列之间用制表符而不是空格分隔,则可以
sed -E '/^#/!s/([^\t]*\t){18}[a-z]*([A-Z]).*/&\t\2/'