我的目录中有几个文件: file1.txt file2.txt file3.txt
对于目录中的每个文件,我只想打印唯一的列。第 1 列将匹配第 3 列或第 4 列;我想打印唯一的列并将其保存为 f_parsed.txt
文件1.txt:
gene1 description1 gene1 gene88
gene56 description2 gene67 gene56
gene6 description3 gene95 gene6
file1_parsed.txt:
gene1 description1 gene88
gene56 description2 gene67
gene6 description3 gene95
这是我到目前为止的代码:
for f in *.txt ; do while IFS= read -r line; do awk -F "," '{if ($3 = $1) {print $1, $2, $3} else {print $1, $2, $4}}' > $f_parsed.txt;
done
然后,对于每个解析的文件,我想 grep f_parsed.txt 第 3 列中的基因,并在 file_B.txt 中查找它,并返回匹配的行和以下行。所有包含匹配项的行均保存为 match1.txt(下一个文件将变为 match2.txt)
file_B.fasta 看起来像这样:
>gene88 | shahid | ahifehhuh
TAGTCTTTCAAAAGA...
>gene6 | shahid | ahifehhuh
TAGTCTTTCAAAAGA...
>gene4 | jeiai | dhdhd
GTCAGTTTTTA...
>gene67 | vdiic | behej
GTCAGTTTTTA...
>gene95 | siis | ahifehhniniuh
TAGTCTTTCAAAAGA...
...
cat f_parsed.txt | while IFS= read -r line; do grep "$3" file_B.fasta |awk '{x=NR+1}(NR<=x){print}' > match1.txt ; done
我开始使用的示例文件的最终输出应该称为 match1.txt,看起来像
>gene88 | shahid | ahifehhuh
TAGTCTTTCAAAAGA...
>gene67 | vdiic | behej
GTCAGTTTTTA...
>gene95 | siis | ahifehhniniuh
TAGTCTTTCAAAAGA...
提前致谢!我知道代码很粗糙,但我是初学者。
答案1
一种方法如下。我们首先读取 fasta 文件并构建一个以基因名称为键的数组。该键对应的值是当前的下n行,换行符分隔。
输出保存在 match*.txt 文件中。
awk -F '|' '
# @the beginning of file, get its type
FNR==1 { inCsv = !(inFasta = FS == "|") }
# get gene name n record next line number
inFasta && /^>/ {
t=$0; gene=$1
gsub(/^.|[[:space:]]*$/, "", gene)
nxtln=NR+1
}
# fill up the value for the current gene
inFasta && NR==nxtln { a[gene] = t ORS $0 }
# we are in CSV file
# close previously open filehandle
# open fresh file handle (match*.txt)
# write to filehandle based on equality
# of field1 and field3
inCsv && NF>3 {
if (FNR == 1) {
close(outf)
outf = "match" ++k ".txt"
}
print a[$($1==$3?4:3)] > outf
}
' file_B.fasta FS=, file*.txt
$ cat match1.txt
>gene88 | shahid | ahifehhuh
TAGTCTTTCAAAAGA...
>gene67 | vdiic | behej
GTCAGTTTTTA...
>gene95 | siis | ahifehhniniuh
TAGTCTTTCAAAAGA..
答案2
awk '{if($1 == $3) {print $1,$2,$NF}else{if($1 == $NF){print $1,$2,$3}}}' filename
输出
gene1 description1 gene88
gene56 description2 gene67
gene6 description3 gene95