我需要重新格式化kegg reconstruct pathway
输出,file1 中有类似这样的内容:
00550 Peptidoglycan biosynthesis (2)
K01000
K02563
00511 Other glycan degradation (8) K01190 K01191
K01192
K01201
K01227
K12309
我需要 file2 中类似的东西:
00550 Peptidoglycan biosynthesis (2) K01000 K02563
00511 Other glycan degradation (6) K01190 K01191 K01192 K01201 K01227 K12309
我如何在 linux 或 python 中重新格式化它?
谢谢
答案1
这会让你走多远:
awk '
!NF {next # don"t process empty lines
}
/^[0-9]+ / {sub (/\([0-9]*\)/, "(" CNT ")", PRT) # for the "glycan" lines (leading numerical)
# correct the count in parentheses
if (PRT) print PRT # print the PRT buffer (NOT first line when empty)
PRT = "" # empty it after print
CNT = gsub (/K[0-9]*/, "&") - 1 # get this line"s "K..." count, corr.for later incr.
}
{PRT = sprintf ("%s%s%s", PRT, PRT?" ":"", $0) # append this line to buffer
CNT++ # increment "K..." count
}
END {sub (/\([0-9]*\)/, "(" CNT ")", PRT) # see above
print PRT
}
' file
00550 Peptidoglycan biosynthesis (2) K01000 K02563
00511 Other glycan degradation (6) K01190 K01191 K01192 K01201 K01227 K12309