我想通过字符“|”转换选项卡之间的空白空间该文件可以从这里下载
wget http://download.cbioportal.org/cancerhotspots/cancerhotspots.v2.maf.gz
cat cancerhotspots.v2.maf | grep -v version | head -3
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC RefSeq SIFT PolyPhen EXON INTRON DOMAINS AF AFR_AF AMR_AF ASN_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF CLIN_SIG SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POMOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH ExAC_AF_SAS GENE_PHENO FILTER flanking_bps variant_id variant_qual ExAC_AF_Adj ExAC_AC_AN_Adj ExAC_AC_AN ExAC_AC_AN_AFR ExAC_AC_AN_AMR ExAC_AC_AN_EAS ExAC_AC_AN_FIN ExAC_AC_AN_NFE ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF TUMORTYPE PLATFORM judgement Amino_Acid_Change Amino_Acid_Position Protein_Lenght Reference_Amino_Acid Variant_Amino_Acid allele_freq tm Amino_Acid_Length Ref_Tri oncotree_organtype oncotree_parent oncotree_detailed Master_ID
WARS2 10352 . GRCh37 1 119575617 119575617 + Missense_Mutation SNP C C T novel 000236 NORMAL C C c.1000G>A p.Val334Ile p.V334I ENST00000235521 6/6 0 . . 0 . . WARS2,missense_variant,p.Val334Ile,ENST00000235521,NM_201263.2,NM_015836.3;WARS2,missense_variant,p.Val240Ile,ENST00000537870,;WARS2,3_prime_UTR_variant,,ENST00000369426,;WARS2,downstream_gene_variant,,ENST00000497402,;WARS2,downstream_gene_variant,,ENST00000495746,; T ENSG00000116874 ENST00000235521 Transcript missense_variant 1027/2800 1000/1083 334/360 V/I Gtt/Att 1 -1 WARS2 HGNC 12730 protein_coding YES CCDS900.1 ENSP00000235521 Q9UGM6 B7Z5X7 UPI000004A002 NM_201263.2,NM_015836.3 tolerated(0.31) benign(0.015) 6/6 Gene3D:1.10.240.10,HAMAP:MF_00140_B,hmmpanther:PTHR10055,Low_complexity_(Seg):seg,Superfamily_domains:SSF52374,TIGRFAM_domain:TIGR00233 MODERATE 1 SNV ACC . . acyc exome RETAIN V334I 334 V I NA WARS2 334 360 ACC headandneck saca acyc 000236
OPN3 23596 . GRCh37 1 241761094 241761094 + Missense_Mutation SNP G G A rs780348058 000236 NORMAL G G c.899C>T p.Ser300Leu p.S300L ENST00000366554 3/4 0 . . 0 . . OPN3,missense_variant,p.Ser300Leu,ENST00000366554,NM_014322.2;OPN3,missense_variant,p.Ser221Leu,ENST00000331838,;KMO,downstream_gene_variant,,ENST00000366559,NM_003679.4;KMO,downstream_gene_variant,,ENST00000366557,;KMO,downstream_gene_variant,,ENST00000366555,;OPN3,non_coding_transcript_exon_variant,,ENST00000469376,;OPN3,non_coding_transcript_exon_variant,,ENST00000490673,;OPN3,non_coding_transcript_exon_variant,,ENST00000478849,;OPN3,non_coding_transcript_exon_variant,,ENST00000463155,;OPN3,non_coding_transcript_exon_variant,,ENST00000462265,; A ENSG00000054277 ENST00000366554 Transcript missense_variant 1006/2620 899/1209 300/402 S/L tCg/tTg rs780348058 1 -1 OPN3 HGNC 14007 protein_coding YES CCDS31072.1 ENSP00000355512 Q9H1Y3 UPI000000165B NM_014322.2 deleterious(0.02) possibly_damaging(0.692) 3/4 Transmembrane_helices:TMhelix,PROSITE_profiles:PS50262,hmmpanther:PTHR24240:SF64,hmmpanther:PTHR24240,PROSITE_patterns:PS00238,Gene3D:1.20.1070.10,Pfam_domain:PF00001,Superfamily_domains:SSF81321,Prints_domain:PR00237 MODERATE 1 SNV 9.415e-06 0 0 0.0001278 0 0 0 0 . CGA . . 9.426e-06 1/106086 1/106208 0/9066 0/11158 1/7822 0/6612 0/54326 0/694 0/16408 PASS acyc exome RETAIN S300L 300 NA OPN3 300 402 TCG headandneck saca acyc 000236
如果该列没有值,则两个选项卡之间有一个空格,我们在计算列数时会看到这一点
cat cancerhotspots.v2.maf | grep -v version | head -4 | awk '{ print NF }'
148
80
99
81
所需的输出。当该列没有值时,用字符“|”替换空格。
cat cancerhotspots.v2.maf | grep -v version | head -2 | sed 's/\t\t/\t|\t/g'
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC RefSeq SIFT PolyPhen EXON INTRON DOMAINS AF AFR_AF AMR_AF ASN_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF CLIN_SIG SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POMOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH ExAC_AF_SAS GENE_PHENO FILTER flanking_bps variant_id variant_qual ExAC_AF_Adj ExAC_AC_AN_Adj ExAC_AC_AN ExAC_AC_AN_AFR ExAC_AC_AN_AMR ExAC_AC_AN_EAS ExAC_AC_AN_FIN ExAC_AC_AN_NFE ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF TUMORTYPE PLATFORM judgement Amino_Acid_Change Amino_Acid_Position Protein_Lenght Reference_Amino_Acid Variant_Amino_Acid allele_freq tm Amino_Acid_Length Ref_Tri oncotree_organtype oncotree_parent oncotree_detailed Master_ID
WARS2 10352 . GRCh37 1 119575617 119575617 + Missense_Mutation SNP C C T novel | 000236 NORMAL C C | | | | | | | | c.1000G>A p.Val334Ile p.V334I ENST00000235521 6/6 0 . . 0 . . WARS2,missense_variant,p.Val334Ile,ENST00000235521,NM_201263.2,NM_015836.3;WARS2,missense_variant,p.Val240Ile,ENST00000537870,;WARS2,3_prime_UTR_variant,,ENST00000369426,;WARS2,downstream_gene_variant,,ENST00000497402,;WARS2,downstream_gene_variant,,ENST00000495746,; T ENSG00000116874 ENST00000235521 Transcript missense_variant 1027/2800 1000/1083 334/360 V/I Gtt/Att | 1 | -1 WARS2 HGNC 12730 protein_coding YES CCDS900.1 ENSP00000235521 Q9UGM6 B7Z5X7 UPI000004A002 NM_201263.2,NM_015836.3 tolerated(0.31) benign(0.015) 6/6 | Gene3D:1.10.240.10,HAMAP:MF_00140_B,hmmpanther:PTHR10055,Low_complexity_(Seg):seg,Superfamily_domains:SSF52374,TIGRFAM_domain:TIGR00233 | | | | | | | | MODERATE 1 SNV | | | | ACC . . | | | | | | | | | | acyc exome RETAIN V334I 334 | V I NA WARS2 334 360 ACC headandneck saca acyc 000236
cat cancerhotspots.v2.maf | grep -v version | head -4 | sed 's/\t\t/\t|\t/g' | awk '{ print NF }'
148
118
128
118
输出应该是 148 列,但标题的列数有差异,为 148。
如何让所有列统一用“|”填充当有空间时。
谢谢 !
答案1
看来您可能想要的是:
awk 'BEGIN{FS=OFS="\t"} {for (i=1;i<=NF;i++) if ($i == "") $i="|"; print}' file
或者:
sed 's/\t\t/\t|\t/g; s/\t\t/\t|\t/g' file
但从提供的例子中很难看出。
使用逗号而不是制表符来提高可见性,这演示了为什么需要使用 sed 进行两次替换:
$ printf 'a,,,,b\n' | sed 's/,,/,|,/g'
a,|,,|,b
$ printf 'a,,,,b\n' | sed 's/,,/,|,/g; s/,,/,|,/g'
a,|,|,|,b
因为正则表达式,,
匹配每对,
s,所以它匹配每个奇数,,
对,但偶数,,
对在执行第二遍之前是不匹配的。另一个例子:
$ printf '12345678\n' | sed 's/\([0-9]\)\([0-9]\)/\1|\2/g'
1|23|45|67|8
$ printf '12345678\n' | sed 's/\([0-9]\)\([0-9]\)/\1|\2/g; s/\([0-9]\)\([0-9]\)/\1|\2/g'
1|2|3|4|5|6|7|8