我有很多文件,我想使用第 1 列中的通用 ID 来合并这些文件。
文件1:
MYORGANISM_I_05140.t1 Atypical/PIKK/FRAP
MYORGANISM_I_06518.t1 CAMK/MLCK
MYORGANISM_I_00854.t1 TK-assoc/SH2/SH2-R
MYORGANISM_I_12755.t1 TK-assoc/SH2/Unique
文件2:
MYORGANISM_I_05140.t1 VALUES to be taken
MYORGANISM_I_12766.t1 what
文件3:
MYORGANISM_I_16941.t1 OK
MYORGANISM_I_93484.t1 LET IT BE
我想合并许多文件并添加“-NA-”(如果缺少值),我想要的输出:
MYORGANISM_I_05140.t1 Atypical/PIKK/FRAP VALUES to be taken -NA-
MYORGANISM_I_06518.t1 CAMK/MLCK -NA- -NA-
MYORGANISM_I_00854.t1 TK-assoc/SH2/SH2-R -NA- -NA-
MYORGANISM_I_12755.t1 TK-assoc/SH2/Unique -NA- -NA-
MYORGANISM_I_12766.t1 -NA- what -NA-
MYORGANISM_I_16941.t1 -NA- -NA- OK
MYORGANISM_I_93484.t1 -NA- -NA- LET IT BE
答案1
使用单一工具这可能是不可能的。这是一个基于脚本的建议,涉及对sort
和两个临时外部文件的调用。
#!/bin/bash
# The number of columns is equal to the number of input files, which is
# equal to the number of command-line arguments.
NUMCOLS=$#
# Use associative container to record all "IDs" and associated fields
declare -A entries
col=0
# Read the fields from all files and store them so that the field values can be
# associated with the file they came from (= the column they belong).
for FILE in "$@"
do
while read id value
do
SORTKEY="$id"__"$col"
entries[$SORTKEY]="$value"
echo "$id" >> "tmp.ids"
done < $FILE
let col=$col+1
done
# Sort the IDs
sort -u "tmp.ids" > "tmp.ids.sorted"
# Read the sorted IDs back in and generate output lines, where the
# column fields are taken from the associative container "entries" and
# tab-separated.
# If "entries" doesn't contain a value for a given key, output "-NA-" instead.
while read id
do
LINE="$id"
for (( col=0; col<NUMCOLS; col++ ))
do
SORTKEY="$id"__"$col"
if [[ -z "${entries[$SORTKEY]}" ]]
then
LINE=$(printf "%s\t-NA-" "$LINE")
else
LINE=$(printf "%s\t%s" "$LINE" "${entries[$SORTKEY]}")
fi
done
echo "$LINE" >> "outfile.txt"
done < "tmp.ids.sorted"
rm tmp.ids tmp.ids.sorted
您可以将此称为./sortscript.sh <file1> <file2> ... <fileN>
.
这将生成一个关联容器entries
,并将从输入文件读取的所有字段存储在从“ID”字段和列号生成的键下。 ID 被写入外部文件中,tmp.ids
以便可以对它们进行排序,这似乎就是您想要的。
排序后,ID 被读回。然后,对于每个 ID,从容器中读取属于该键的所有可用字段entries
并将其放置在输出行(变量LINE
)上。如果没有可用于特定 ID/列组合的值,请-NA-
改为写入。
然后将输出行写入文件outfile.txt
。
答案2
您可以使用该join
实用程序两次来在三个文件上生成两个“外部联接”。假设所有三个文件都是制表符分隔的,首先是前两个文件:
$ join -a 1 -a 2 -o 0,1.2,2.2 -e '-NA-' -t $'\t' <( sort File1 ) <( sort File2 )
MYORGANISM_I_05140.t1 Atypical/PIKK/FRAP VALUES to be taken
MYORGANISM_I_06518.t1 CAMK/MLCK -NA-
MYORGANISM_I_00854.t1 TK-assoc/SH2/SH2-R -NA-
MYORGANISM_I_12755.t1 TK-assoc/SH2/Unique -NA-
MYORGANISM_I_12766.t1 -NA- what
这要求join
实用程序在第一个字段(默认值)上加入已排序的文件。我们明确表示-a 1 -a2
要从两个文件中获取所有行,即使它们不匹配,并且-o 0,1.2,2.2
我们请求输出包含连接字段(第一列)以及每个文件的第二列。该-e '-NA-'
选项指定用什么字符串填充空字段。
上面为我们提供了一个新的数据集,我们可以在与第三个文件的第二次连接中使用它。为了简单起见,假设上面的结果在tmpdata
(重定向到那里之后)可用,那么
$ join -a 1 -a 2 -o 0,1.2,1.3,2.2 -e '-NA-' -t $'\t' tmpdata <( sort FILE3 )
MYORGANISM_I_00854.t1 TK-assoc/SH2/SH2-R -NA- -NA-
MYORGANISM_I_05140.t1 Atypical/PIKK/FRAP VALUES to be taken -NA-
MYORGANISM_I_06518.t1 CAMK/MLCK -NA- -NA-
MYORGANISM_I_12755.t1 TK-assoc/SH2/Unique -NA- -NA-
MYORGANISM_I_12766.t1 -NA- what -NA-
MYORGANISM_I_16941.t1 -NA- -NA- OK
MYORGANISM_I_93484.t1 -NA- -NA- LET IT BE
这或多或少地重复了之前的“外部连接”,但还添加了一个带有-o
选项的额外列。
答案3
在 shell 中执行的一行代码(/bin/sh 或 /bin/bash):
FILES="File1 File2 FILE3"; LIST=$(for F in ${FILES}; do cat ${F}|awk '{print $1}'; done|sort|uniq|xargs); for i in ${LIST}; do echo -n "$i"; for F in ${FILES}; do L=$(grep "^${i}\s" ${F}|head -1|sed 's/\t/ /'|cut -d' ' -f 2-|sed 's/^\s*//g'); [ -z "${L}" ] && echo -n " -NA-" || echo -n " ${L}" ; done; echo; done|sort
输出:
MYORGANISM_I_00854.t1 TK-assoc/SH2/SH2-R -NA- -NA-
MYORGANISM_I_05140.t1 Atypical/PIKK/FRAP VALUES to be taken -NA-
MYORGANISM_I_06518.t1 CAMK/MLCK -NA- -NA-
MYORGANISM_I_12755.t1 TK-assoc/SH2/Unique -NA- -NA-
MYORGANISM_I_12766.t1 -NA- what -NA-
MYORGANISM_I_16941.t1 -NA- -NA- OK
MYORGANISM_I_93484.t1 -NA- -NA- LET IT BE
解释:
# create list of files
# it can be created based on search like
# find . -type f -name filename.txt
# or something different
FILES="File1 File2 FILE3";
# create a list if unique first lines from all files from the list FILES
LIST=$(for F in ${FILES}; do cat ${F}|awk '{print $1}'; done|sort|uniq|xargs);
# take one by one each first line
# and go through all the files find corresponding lines endings
# and put them together
# or take '-NA-' for non-existing
for i in ${LIST}; do
echo -n "$i";
for F in ${FILES}; do
#
# old version of line commented out
# L=$(grep "^${i}\s" ${F}|head -1|cut -d' ' -f 2-|sed 's/^\s*//g');
# new version of line to make tab separator working
L=$(grep "^${i}\s" ${F}|head -1|sed 's/\t/ /'|cut -d' ' -f 2-|sed 's/^\s*//g');
#
[ -z "${L}" ] && echo -n " -NA-" || echo -n " ${L}" ;
done;
echo;
done|sort
# sorted results printed