根据任何列将 uniq 标识符或序列号分配给重复或 uinq 值

根据任何列将 uniq 标识符或序列号分配给重复或 uinq 值

我有这样的输入文件,它有重复的值,我想根据第一列(col1)分钟为每个uniq值提供uniq标识符,根据col1的重复值,重复行在小数点前具有相同的标识符,任何帮助guyz,谢谢进步。

    Ca3CNSNP431180  2428    2435    0   TTTATttt    AT-Hook 1
    Ca3CNSNP431179  2429    2437    0   TTATTttat   AT-Hook 1
    Ca3CNSNP431178  2428    2436    0   TTTATttta   AT-Hook 1
    Ca4CNSNP431177  1384    1388    0   ATTGA   NF-YB;NF-YA;NF-YC   1
    Ca4CNSNP431176  1382    1386    0   AGATT   Myb/SANT;MYB;ARR-B  1
    Ca4CNSNP431175  1382    1386    0   AGATT   GATA;tify   1
    Ca4CNSNP431174  1386    1398    0   tgaAATTTtcatt   TCR;CPP 2
    Ca4CNSNP431174  1386    1398    0   tgaAATTTtcatt   TCR;CPP 2
    Ca4CNSNP431172  1383    1395    0   gattgAAATTttc   TCR;CPP 2
    Ca4CNSNP431172  1383    1395    0   gattgAAATTttc   TCR;CPP 2
    Ca3CNSNP430205  3334    3343    0   tATATAtata  AT-Hook 3
    Ca3CNSNP430205  3334    3343    0   tATATAtata  AT-Hook 3
    Ca3CNSNP430205  3334    3343    0   tATATAtata  AT-Hook 3

期望的输出:

identifier  col1    col2    col3    col4    col5    col6    col7
 000001.1   Ca3CNSNP431180  2428    2435    0   TTTATttt    AT-Hook 1
 000002.1       Ca3CNSNP431179  2429    2437    0   TTATTttat   AT-Hook 1
 000003.1       Ca3CNSNP431178  2428    2436    0   TTTATttta   AT-Hook 1
 000004.1       Ca4CNSNP431177  1384    1388    0   ATTGA   NF-YB;NF-YA;NF-YC   1
 000005.1       Ca4CNSNP431176  1382    1386    0   AGATT   Myb/SANT;MYB;ARR-B  1
 000006.1       Ca4CNSNP431175  1382    1386    0   AGATT   GATA;tify   1
 000007.1       Ca4CNSNP431174  1386    1398    0   tgaAATTTtcatt   TCR;CPP 2
 000007.2       Ca4CNSNP431174  1386    1398    0   tgaAATTTtcatt   TCR;CPP 2
 000008.1       Ca4CNSNP431172  1383    1395    0   gattgAAATTttc   TCR;CPP 2
 000008.2       Ca4CNSNP431172  1383    1395    0   gattgAAATTttc   TCR;CPP 2
 000009.1       Ca3CNSNP430205  3334    3343    0   tATATAtata  AT-Hook 3
 000009.2       Ca3CNSNP430205  3334    3343    0   tATATAtata  AT-Hook 3
 000009.3       Ca3CNSNP430205  3334    3343    0   tATATAtata  AT-Hook 3

答案1

短的awk解决方案:

awk '{ printf "%06d.%d\t%s\n",(!a[$1]++? ++c:c),a[$1],$0 }' file
  • !a[$1]++- 检查第一个字段的唯一值$1(用作数组的索引a

  • ++c-c是一个字首每个唯一条目的值递增(小数点之前)

  • a[$1]-后缀value(小数点后)- 指向第一个字段值出现的次数$1

  • %06d.%d- 输出格式说明符,其中%06d指向的大小整数部分数字d(大小为 6,带前导零06)和.%d-规模的数量(小数部分


输出:

000001.1    Ca3CNSNP431180  2428    2435    0   TTTATttt    AT-Hook 1
000002.1    Ca3CNSNP431179  2429    2437    0   TTATTttat   AT-Hook 1
000003.1    Ca3CNSNP431178  2428    2436    0   TTTATttta   AT-Hook 1
000004.1    Ca4CNSNP431177  1384    1388    0   ATTGA   NF-YB;NF-YA;NF-YC   1
000005.1    Ca4CNSNP431176  1382    1386    0   AGATT   Myb/SANT;MYB;ARR-B  1
000006.1    Ca4CNSNP431175  1382    1386    0   AGATT   GATA;tify   1
000007.1    Ca4CNSNP431174  1386    1398    0   tgaAATTTtcatt   TCR;CPP 2
000007.2    Ca4CNSNP431174  1386    1398    0   tgaAATTTtcatt   TCR;CPP 2
000008.1    Ca4CNSNP431172  1383    1395    0   gattgAAATTttc   TCR;CPP 2
000008.2    Ca4CNSNP431172  1383    1395    0   gattgAAATTttc   TCR;CPP 2
000009.1    Ca3CNSNP430205  3334    3343    0   tATATAtata  AT-Hook 3
000009.2    Ca3CNSNP430205  3334    3343    0   tATATAtata  AT-Hook 3
000009.3    Ca3CNSNP430205  3334    3343    0   tATATAtata  AT-Hook 3

相关内容