我有一个文件(制表符分隔)包含具有不同列数的行。像这样:
Bin_37:_Pelotomaculum_sp._DTU098 GH3 GH57 GH15 GH18 GT2 GT4 GT28
Bin_45_1:_Thiopseudomonas_denitrificans GH3 GH57 GT2 GT9 CBM48
...
My question is: how can I generate another file (tsv) containing the comparision of rows by column where the data are organized. Missing values are filled up with NA. For example, like this:
Bin_37:_Pelotomaculum_sp._DTU098 GH3 GH57 GH15 GH18 GT2 GT4 GT28 NA NA
Bin_45_1:_Thiopseudomonas_denitrificans GH3 GH57 NA NA GT2 NA NA GT9 CBM48
...
答案1
也许对于超大文件来说不是最有效的,但是工作版本。
输入文件文件1:
Bin_37:_Pelotomaculum_sp._DTU098 GH3 GH57 GH15 GH18 GT2 GT4 GT28
Bin_45_1:_Thiopseudomonas_denitrificans GH3 GH57 GT2 GT9 CBM48
Bin_99:_to_make_sure_no_columns_is_ok
脚本(/bin/sh 或 /bin/bash):
#!/bin/sh
F="file1";
COLS=$(cat "${F}"|sed 's/^[^\t]*//g;s/\t/\n/g'|sort|uniq|xargs);
# list of all available unique columns in SORTED order
echo "All avaiulable columns: [${COLS}]";
echo
# reading from the file line by line
cat "${F}"|while read L; do
# assign to A the first column
A=$(echo "${L}"|cut -d' ' -f1);
# if A is not empty
[ -n "${A}" ] &&
{
# take one by one all possible column values
for C in ${COLS}; do
# if the taken line has such column, add it to A,
# otherwise add to A NA
echo "${L} "|grep "\s${C}\s" >/dev/null &&
A="$A"$'\t'"${C}" ||
A="$A"$'\tNA';
done;
# print result line
echo "${A}";
};
done
输出:
All avaiulable columns: [CBM48 GH15 GH18 GH3 GH57 GT2 GT28 GT4 GT9]
Bin_37:_Pelotomaculum_sp._DTU098 NA GH15 GH18 GH3 GH57 GT2 GT28 GT4 NA
Bin_45_1:_Thiopseudomonas_denitrificans CBM48 NA NA GH3 GH57 GT2 NA NA GT9
Bin_99:_to_make_sure_no_columns_is_ok NA NA NA NA NA NA NA NA NA
相同(开头没有可用列的列表)作为一个班轮:
F="file1"; COLS=$(cat "${F}"|sed 's/^[^\t]*//g;s/\t/\n/g'|sort|uniq|xargs); cat "${F}"|while read L; do A=$(echo "${L}"|cut -d' ' -f1); [ -n "${A}" ] && { for C in ${COLS}; do echo "${L} "|grep "\s${C}\s" >/dev/null && A="$A"$'\t'"${C}" || A="$A"$'\tNA'; done; echo "${A}"; }; done
更新。优化更高效的版本,基于评论中的建议(需要/bin/bash):
F="file1"; IFS=$'\n'; COLS=($(sed 's/^[^\t]*//g;s/\t/\n/g' "${F}"|sort -u)); while read -r L; do A="${L%%$'\t'*}"; [ -n "${A}" ] && for C in ${COLS[@]}; do [[ "${L}"$'\t' == *$'\t'"${C}"$'\t'* ]] && A="$A"$'\t'"${C}" || A="$A"$'\tNA'; done && echo "${A}"; done <${F}; IFS=' '
答案2
与到目前为止的所有其他答案一样,这不会产生您从提供的输入中提供的预期输出,但如果您的输入实际上包含空的制表符分隔字段,那么这将用NA
s 填充这些字段:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR == FNR {
gsub(/\t+$/,"")
maxNF = (NF>maxNF ? NF : maxNF)
next
}
{
for (i=1; i<=maxNF; i++) {
printf "%s%s", ($i == "" ? "NA" : $i), (i < maxNF ? OFS : ORS)
}
}
$ awk -f tst.awk file file
Bin_37:_Pelotomaculum_sp._DTU098 GH3 GH57 GH15 GH18 GT2 GT4 GT28
Bin_45_1:_Thiopseudomonas_denitrificans GH3 GH57 GT2 GT9 CBM48 NA NA
答案3
如果您的输入文件是制表符分隔的,您可以使用此 GNU awk 脚本:
awk 'BEGIN{RS="[\t\n]"} !NF{$1="NA"} {printf "%s%s", $0, RT}' file
记录分隔符RS
设置为制表符或换行符,以便获取 中的字段数NF
。
如果NF
为空,意味着两个选项卡之间没有单词,则NA
添加该字符串。
该脚本使用记录终止符RT
(a\t
或 a \n
)打印结果记录。