如何删除逗号并再次打印整行逗号后面的单词

如何删除逗号并再次打印整行逗号后面的单词

文件:

chr1_156186369  chr1_156186369_A_C,T    A   C,T  33150.29  1/2:0,4,6:10:88:272
chr19_27732257  chr19_27732257_G_C      G   C    262.29    1/2:1,10,7:18:99:414,167
chrM_2619       chrM_2619_A_G,T         A   G,T  33023.29  1/2:0,5,5:10:99:293,144,129
chr9_119375271  chr9_119375271_T_A,G    T   A,G  248.29    1/2:1,11,5:17:99:359,107,113

我只需要从第 2 列和第 4 列中删除逗号,并打印逗号后面的单词的整行。

预期输出是:

chr1_156186369  chr1_156186369_A_C  A   C   33150.29  1/2:0,4,6:10:88:272
chr1_156186369  chr1_156186369_A_T  A   T   33150.29  1/2:0,4,6:10:88:272 
chr19_27732257  chr19_27732257_G_C  G   C   262.29    1/2:1,10,7:18:99:414,167
chrM_2619       chrM_2619_A_G       A   G   33023.29  1/2:0,5,5:10:99:293,144,129
chrM_2619       chrM_2619_A_T       A   T   33023.29  1/2:0,5,5:10:99:293,144,129
chr9_119375271  chr9_119375271_T_A  T   A   248.29    1/2:1,11,5:17:99:359,107,113
chr9_119375271  chr9_119375271_T_G  T   G   248.29    1/2:1,11,5:17:99:359,107,113 

我尝试了 awk 但没有得到任何结果,我也在这里阅读了类似类型的问题 如何在特定条件下从文件中提取行

答案1

使用 awk:

awk '{
  split ($2,w2,",");
  split ($4,w4,",");
  for (i in w4) {
    print $1,substr(w2[1],0,length(w2[1])-length(w4[i])) w4[i],$3,w4[i],$5,$6;
  }}'

请注意,如果第 2 列和第 4 列的逗号后面的值不相等,则不会进行错误处理。

答案2

假设sed单字符分隔值C,T会重复

$ sed -E 's/^(.*)([A-Z]),([A-Z])(.*)\2,\3(.*)/\1\2\4\2\5\n\1\3\4\3\5/' ip.txt 
chr1_156186369  chr1_156186369_A_C    A   C  33150.29  1/2:0,4,6:10:88:272
chr1_156186369  chr1_156186369_A_T    A   T  33150.29  1/2:0,4,6:10:88:272
chr19_27732257  chr19_27732257_G_C      G   C    262.29    1/2:1,10,7:18:99:414,167
chrM_2619       chrM_2619_A_G         A   G  33023.29  1/2:0,5,5:10:99:293,144,129
chrM_2619       chrM_2619_A_T         A   T  33023.29  1/2:0,5,5:10:99:293,144,129
chr9_119375271  chr9_119375271_T_A    T   A  248.29    1/2:1,11,5:17:99:359,107,113
chr9_119375271  chr9_119375271_T_G    T   G  248.29    1/2:1,11,5:17:99:359,107,113
  • ^(.*)起始文本
  • ([A-Z]),([A-Z])逗号分隔的单个字符
  • (.*)重复之间的文字
  • \2,\3再次匹配逗号分隔的单个字符
  • (.*)其余线
  • \1\2\4\2\5\n\1\3\4\3\5所需的输出格式
  • 请注意,间距与预期输出并不完全匹配

答案3

我不知道如何使用单个命令来完成此操作,但它适用于以下循环bash

cat data.dat | while read line
do
  if echo "${line}" | grep -q '[[:alpha:]],[[:alpha:]]'
  then
    letters=`echo "${line}" | grep -o '[[:alpha:]],[[:alpha:]]' | head -n 1`
    for letter in `echo ${letters} | sed 's/,/ /g'`
    do
      echo "${line}" | sed 's/'"${letters}"'/'"${letter}"'  /g'
    done
  else
    echo "${line}"
  fi
done

答案4

用逗号分割第四个字段并使用该列中的切片,并将最后一个_X,Y字段替换为_slice,如果有的话:

awk '{
      n=split($4,slices,",")
      for(i=1;i<=n;i++) {
        res=$2
        sub(/.,.*/,slices[i],res)
        print $1, res, $3, slices[i], $5, $6
      }
     }' file

我不太喜欢打印字段的方式,因为我确实指示了从第 1 个到第 6 个字段,所以希望这是静态的。

$ awk '{n=split($4,slices,","); for(i=1;i<=n;i++) {res=$2; sub(/.,.*/,slices[i],res); print $1, res, $3, slices[i], $5, $6}}' a
chr1_156186369 chr1_156186369_A_C A C 33150.29 1/2:0,4,6:10:88:272
chr1_156186369 chr1_156186369_A_T A T 33150.29 1/2:0,4,6:10:88:272
chr19_27732257 chr19_27732257_G_C G C 262.29 1/2:1,10,7:18:99:414,167
chrM_2619 chrM_2619_A_G A G 33023.29 1/2:0,5,5:10:99:293,144,129
chrM_2619 chrM_2619_A_T A T 33023.29 1/2:0,5,5:10:99:293,144,129
chr9_119375271 chr9_119375271_T_A T A 248.29 1/2:1,11,5:17:99:359,107,113
chr9_119375271 chr9_119375271_T_G T G 248.29 1/2:1,11,5:17:99:359,107,113

相关内容