我有这样的输入文件,它有重复的值,我想根据第一列(col1)分钟为每个uniq值提供uniq标识符,根据col1的重复值,重复行在小数点前具有相同的标识符,任何帮助guyz,谢谢进步。
Ca3CNSNP431180 2428 2435 0 TTTATttt AT-Hook 1
Ca3CNSNP431179 2429 2437 0 TTATTttat AT-Hook 1
Ca3CNSNP431178 2428 2436 0 TTTATttta AT-Hook 1
Ca4CNSNP431177 1384 1388 0 ATTGA NF-YB;NF-YA;NF-YC 1
Ca4CNSNP431176 1382 1386 0 AGATT Myb/SANT;MYB;ARR-B 1
Ca4CNSNP431175 1382 1386 0 AGATT GATA;tify 1
Ca4CNSNP431174 1386 1398 0 tgaAATTTtcatt TCR;CPP 2
Ca4CNSNP431174 1386 1398 0 tgaAATTTtcatt TCR;CPP 2
Ca4CNSNP431172 1383 1395 0 gattgAAATTttc TCR;CPP 2
Ca4CNSNP431172 1383 1395 0 gattgAAATTttc TCR;CPP 2
Ca3CNSNP430205 3334 3343 0 tATATAtata AT-Hook 3
Ca3CNSNP430205 3334 3343 0 tATATAtata AT-Hook 3
Ca3CNSNP430205 3334 3343 0 tATATAtata AT-Hook 3
期望的输出:
identifier col1 col2 col3 col4 col5 col6 col7
000001.1 Ca3CNSNP431180 2428 2435 0 TTTATttt AT-Hook 1
000002.1 Ca3CNSNP431179 2429 2437 0 TTATTttat AT-Hook 1
000003.1 Ca3CNSNP431178 2428 2436 0 TTTATttta AT-Hook 1
000004.1 Ca4CNSNP431177 1384 1388 0 ATTGA NF-YB;NF-YA;NF-YC 1
000005.1 Ca4CNSNP431176 1382 1386 0 AGATT Myb/SANT;MYB;ARR-B 1
000006.1 Ca4CNSNP431175 1382 1386 0 AGATT GATA;tify 1
000007.1 Ca4CNSNP431174 1386 1398 0 tgaAATTTtcatt TCR;CPP 2
000007.2 Ca4CNSNP431174 1386 1398 0 tgaAATTTtcatt TCR;CPP 2
000008.1 Ca4CNSNP431172 1383 1395 0 gattgAAATTttc TCR;CPP 2
000008.2 Ca4CNSNP431172 1383 1395 0 gattgAAATTttc TCR;CPP 2
000009.1 Ca3CNSNP430205 3334 3343 0 tATATAtata AT-Hook 3
000009.2 Ca3CNSNP430205 3334 3343 0 tATATAtata AT-Hook 3
000009.3 Ca3CNSNP430205 3334 3343 0 tATATAtata AT-Hook 3
答案1
短的awk
解决方案:
awk '{ printf "%06d.%d\t%s\n",(!a[$1]++? ++c:c),a[$1],$0 }' file
!a[$1]++
- 检查第一个字段的唯一值$1
(用作数组的索引a
)++c
-c
是一个字首每个唯一条目的值递增(小数点之前)a[$1]
-后缀value(小数点后)- 指向第一个字段值出现的次数$1
%06d.%d
- 输出格式说明符,其中%06d
指向的大小整数部分数字d
(大小为 6,带前导零06
)和.%d
-规模的数量(小数部分)
输出:
000001.1 Ca3CNSNP431180 2428 2435 0 TTTATttt AT-Hook 1
000002.1 Ca3CNSNP431179 2429 2437 0 TTATTttat AT-Hook 1
000003.1 Ca3CNSNP431178 2428 2436 0 TTTATttta AT-Hook 1
000004.1 Ca4CNSNP431177 1384 1388 0 ATTGA NF-YB;NF-YA;NF-YC 1
000005.1 Ca4CNSNP431176 1382 1386 0 AGATT Myb/SANT;MYB;ARR-B 1
000006.1 Ca4CNSNP431175 1382 1386 0 AGATT GATA;tify 1
000007.1 Ca4CNSNP431174 1386 1398 0 tgaAATTTtcatt TCR;CPP 2
000007.2 Ca4CNSNP431174 1386 1398 0 tgaAATTTtcatt TCR;CPP 2
000008.1 Ca4CNSNP431172 1383 1395 0 gattgAAATTttc TCR;CPP 2
000008.2 Ca4CNSNP431172 1383 1395 0 gattgAAATTttc TCR;CPP 2
000009.1 Ca3CNSNP430205 3334 3343 0 tATATAtata AT-Hook 3
000009.2 Ca3CNSNP430205 3334 3343 0 tATATAtata AT-Hook 3
000009.3 Ca3CNSNP430205 3334 3343 0 tATATAtata AT-Hook 3