删除按列组织的数据文件中包含“_XX”的所有子字段

删除按列组织的数据文件中包含“_XX”的所有子字段

我有这种类型的文件

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    69511   rs2691305       A       G       .       PASS    AC=70482;AN=83312;AF=0.846001;popmax=eas;faf95_popmax=0.975176;AC_non_v2_XX=28734;AN_non_v2_XX=33774;AF_non_v2_XX=0.850773;nhomalt_non_v2_XX=13253;AC_non_cancer_fin_XX=1080;AN_non_cancer_fin_XX=1090;AF_non_cancer_fin_XX=0.990826;nhomalt_non_cancer_fin_XX=537;AC_non_neuro_nfe=32992;AN_non_neuro_nfe=34106;AF_non_neuro_nfe=0.967337;nhomalt_non_neuro_nfe=16243;AC_non_neuro_afr_XY=5275;AN_non_neuro_afr_XY=8862;AF_non_neuro_afr_XY=0.595238;nhomalt_non_neuro_afr_XY=1908;AC_non_neuro_nfe_XY=13529;AN_non_neuro_nfe_XY=13954;AF_non_neuro_nfe_XY=0.969543;nhomalt_non_neuro_nfe_XY=6668;AC_controls_and_biobanks_eas_XY=1209;AN_controls_and_biobanks_eas_XY=1210;AF_controls_and_biobanks_eas_XY=0.999174;nhomalt_controls_and_biobanks_eas_XY=604;AC_non_neuro_sas_XX=623;AN_non_neuro_sas_XX=646;AF_non_neuro_sas_XX=0.964396;nhomalt_non_neuro_sas_XX=304;AC_non_v2=53211;AN_non_v2=62346;AF_non_v2=0.853479;nhomalt_non_v2=24615;AC_non_topmed_nfe_XX=4123;AN_non_topmed_nfe_XX=4274;AF_non_topmed_nfe_XX=0.96467;nhomalt_non_topmed_nfe_XX=2022;AC_non_v2_mid=132;AN_non_v2_mid=146;AF_non_v2_mid=0.90411;nhomalt_non_v2_mid=63;AC_non_topmed_sas=2561;AN_non_topmed_sas=2620;AF_non_topmed_sas=0.977481;nhomalt_non_topmed_sas=1262;AC_non_cancer_eas_XX=1882;AN_non_cancer_eas_XX=1882;AC_nfe=34637;AN_nfe=35806;AF_nfe=0.967352;nhomalt_nfe=17050;AC_popmax=4423;AN_popmax=4424;AF_popmax=0.999774;nhomalt_popmax=2211;faf95_sas=0.945756;faf99_sas=0.93297;faf95_eas=0.975176;faf99_eas=0.965135;faf95_amr=0.876957;faf99_amr=0.869526;faf95_afr=0.586993;faf99_afr=0.583779;faf95=0.840765;faf99=0.838605;faf95_nfe=0.958818;faf99_nfe=0.955301;age_hist_het_bin_freq=123|146|149|217|287|242|200|187|124|72;age_hist_het_n_smaller=460;age_hist_het_n_larger=24;age_hist_hom_bin_freq=569|655|765|1405|2028|1746|1691|1518|1005|654;age_hist_hom_n_smaller=1206;age_hist_hom_n_larger=215;FS=0;MQ=41.6849;MQRankSum=-3.98;QD=26.0414;ReadPosRankSum=0.51;VarDP=1761397;QUALapprox=45869299;AS_FS=0;AS_MQ=41.6847;AS_MQRankSum=-3.972;AS_pab_max=1;AS_QD=26.0413;AS_ReadPosRankSum=0.51;AS_SOR=0.865485;InbreedingCoeff=0.521183;AS_SB_TABLE=99037,96244|856526,709549;AS_VQSLOD=0.4266;AS_culprit=AS_MQRankSum;NEGATIVE_TRAIN_SITE;allele_type=snv;n_alt_alleles=2;variant_type=multi-snv;segdup;gq_hist_alt_bin_freq=0|0|0|0|39|89|5016|5173|2494|4247|3343|1353|2059|1764|858|1286|1197|599|785|7691;gq_hist_all_bin_freq=0|0|0|0|2833|660|5232|5237|2506|4249|3343|1354|2060|1764|859|1287|1197|599|785|7691;dp_hist_alt_bin_freq=0|0|12706|9449|5594|4053|2707|1532|828|486|262|166|89|45|36|12|6|7|2|3;dp_hist_alt_n_larger=10;dp_hist_all_bin_freq=0|0|13838|11344|6042|4209|2735|1535|828|487|262|166|89|45|36|12|6|7|2|3;dp_hist_all_n_smaller=0;dp_hist_all_n_larger=10;ab_hist_alt_bin_freq=0|0|0|0|193|367|513|688|876|676|772|504|439|263|124|56|25|8|0|0;cadd_raw_score=0.112916;cadd_phred=2.209;revel_score=0.053;splice_ai_max_ds=0.02;splice_ai_consequence=donor_gain;primate_ai_score=0.632586;vep=G|missense_variant|MODERATE|OR4F5|ENSG00000186092|Transcript|ENST00000335137|protein_coding|1/1||ENST00000335137.4:c.421A>G|ENSP00000334393.3:p.Thr141Ala|457|421|141|T/A|Aca/Gca|1||1|SNV||HGNC|HGNC:14825|YES||P1|CCDS30547.1|ENSP00000334393|||||tolerated(0.820)|benign(0.000)|Gene3D:1&Pfam:PF13853&PROSITE_profiles:PS50262&Superfamily:SSF81321&Transmembrane_helices:TMhelix&CDD:cd15226|||||||||,G|missense_variant|MODERATE|OR4F5|ENSG00000186092|Transcript|ENST00000641515|protein_coding|3/3||ENST00000641515.2:c.484A>G|ENSP00000493376.2:p.Thr162Ala|544|484|162|T/A|Aca/Gca|1||1|SNV||HGNC|HGNC:14825|||||ENSP00000493376|||||tolerated(0.850)|benign(0.000)|Transmembrane_helices:TMhelix&CDD:cd15226&PANTHER:PTHR26451&PANTHER:PTHR26451&Pfam:PF13853&PROSITE_profiles:PS50262&Gene3D:1&Superfamily:SSF81321|||||||||,G|missense_variant|MODERATE|OR4F5|79501|Transcript|NM_001005484.1|protein_coding|1/1||NM_001005484.1:c.421A>G|NP_001005484.1:p.Thr141Ala|421|421|141|T/A|Aca/Gca|1||1|SNV||EntrezGene|HGNC:14825|YES||||NP_001005484.1|||||tolerated(0.820)|benign(0.000)||||||||||

并且所有包含在分号之间的字符串都_XX应该被删除,以便从

AF_controls_and_biobanks_eas_XY=0.999174;nhomalt_controls_and_biobanks_eas_XY=604;AC_non_neuro_sas_XX=623;AN_non_neuro_sas_XX=646;

仅有的

AF_controls_and_biobanks_eas_XY=0.999174;nhomalt_controls_and_biobanks_eas_XY=604;

留下来。

我尝试使用sed,但没有成功:

s/[^ ]*_XX[^ ]*//ig

sed -e 's/[^ ]*_XX[^ ]*//ig'

sed '/^_XX/s//'
   
sed 's/*XX//' 

sed 's/^[0-9]*_XX|//'

sed 's/;.*_XX//'

任何帮助拯救我的生命的帮助将不胜感激。

谢谢你!

答案1

这是解决方案:

sed 's/;[^;]*_XX[^;]*//g'

_XX您需要在两个 s 内进行查找;,因此,您应该让每个其他字符都通过。

答案2

Unxnut已经给你了一个可行的解决方案,但为了使其在 VCF 文件中安全工作,您还需要确保跳过标题行(以 开头的行#)并仅将更改应用于 INFO 字段,即该行的第 8 个制表符分隔字段。最后,您还可以使其_XX在子字段是 INFO 字段的第一个子字段时也能起作用。将所有这些放在一起,您可以执行以下操作:

sed -E '/^[^#]/{s/(\t|;)[^;]+_XX=[^;\t]+/\1/g}' file.vcf

答案3

要操作制表符分隔数据的第 8 个字段中的文本,并且仅在不以 开头的行上操作#

awk 'BEGIN { FS = OFS = "\t" }
!/^#/ {
    gsub(/;[^;=]*_XX=[^;]*/, "", $8)
    sub(/^[^;=]*_XX=[^;]*(;|$)/, "", $8)
} 1
' file

答案4

使用磨坊主,将INFO字段拆分为子字段,正则表达式过滤子键,然后将其重新连接在一起:

mlr --tsv put -S '
  m = splitkvx($INFO,"=",";"); 
  for(k,v in m){if(string(k) =~ "_XX$"){m = mapexcept(m,k)}}; 
  $INFO = joinkv(m,"=",";")
' file.vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    69511   rs2691305       A       G       .       PASS    AC=70482;AN=83312;AF=0.846001;popmax=eas;faf95_popmax=0.975176;AC_non_neuro_nfe=32992;AN_non_neuro_nfe=34106;AF_non_neuro_nfe=0.967337;nhomalt_non_neuro_nfe=16243;AC_non_neuro_afr_XY=5275;AN_non_neuro_afr_XY=8862;AF_non_neuro_afr_XY=0.595238;nhomalt_non_neuro_afr_XY=1908;AC_non_neuro_nfe_XY=13529;AN_non_neuro_nfe_XY=13954;AF_non_neuro_nfe_XY=0.969543;nhomalt_non_neuro_nfe_XY=6668;AC_controls_and_biobanks_eas_XY=1209;AN_controls_and_biobanks_eas_XY=1210;AF_controls_and_biobanks_eas_XY=0.999174;nhomalt_controls_and_biobanks_eas_XY=604;AC_non_v2=53211;AN_non_v2=62346;AF_non_v2=0.853479;nhomalt_non_v2=24615;AC_non_v2_mid=132;AN_non_v2_mid=146;AF_non_v2_mid=0.90411;nhomalt_non_v2_mid=63;AC_non_topmed_sas=2561;47=AN_non_t

或者(也许更有效)将要排除的元素的键组装到映射中,然后使用mapdiff将它们一起删除:

$ mlr --tsv put -S '
  m = splitkvx($INFO,"=",";");
  e = {};
  for(k,v in m){if(string(k) =~ "_XX$"){e[k] = 1}};
  $INFO = joinkv(mapdiff(m,e),"=",";")
' file.vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    69511   rs2691305       A       G       .       PASS    AC=70482;AN=83312;AF=0.846001;popmax=eas;faf95_popmax=0.975176;AC_non_neuro_nfe=32992;AN_non_neuro_nfe=34106;AF_non_neuro_nfe=0.967337;nhomalt_non_neuro_nfe=16243;AC_non_neuro_afr_XY=5275;AN_non_neuro_afr_XY=8862;AF_non_neuro_afr_XY=0.595238;nhomalt_non_neuro_afr_XY=1908;AC_non_neuro_nfe_XY=13529;AN_non_neuro_nfe_XY=13954;AF_non_neuro_nfe_XY=0.969543;nhomalt_non_neuro_nfe_XY=6668;AC_controls_and_biobanks_eas_XY=1209;AN_controls_and_biobanks_eas_XY=1210;AF_controls_and_biobanks_eas_XY=0.999174;nhomalt_controls_and_biobanks_eas_XY=604;AC_non_v2=53211;AN_non_v2=62346;AF_non_v2=0.853479;nhomalt_non_v2=24615;AC_non_v2_mid=132;AN_non_v2_mid=146;AF_non_v2_mid=0.90411;nhomalt_non_v2_mid=63;AC_non_topmed_sas=2561;47=AN_non_t

相关内容