使用 file_b 中 2 列的信息从 file_a 中提取名称

Question 1

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) {
    # you can specify the "proximity" as required
    return val > lower - proximity && val < upper + proximity
}

BEGIN {
    OFS="\t"
}

$1 == id && within_range(pos, $4, $5, 100) {
    name = gensub(/.*Name=([^\t]*).*/, "\\1", 1)
    if (name ~ /[^[:space:]]+/)
        print id, pos, name
}

然后运行

while read -r id pos
do
    awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

请确保在.tsv处理文件中的字段之前用制表符分隔它们。我的输出：

MT  4050    mt-nd2
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

对于ID来说MT，基因命中应该是mt-nd2不行的mt-nd1。

我还是推荐使用Python来进行数据处理。

Answer

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) {
    # you can specify the "proximity" as required
    return val > lower - proximity && val < upper + proximity
}

BEGIN {
    OFS="\t"
}

$1 == id && within_range(pos, $4, $5, 100) {
    name = gensub(/.*Name=([^\t]*).*/, "\\1", 1)
    if (name ~ /[^[:space:]]+/)
        print id, pos, name
}

然后运行

while read -r id pos
do
    awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

请确保在.tsv处理文件中的字段之前用制表符分隔它们。我的输出：

MT  4050    mt-nd2
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

对于ID来说MT，基因命中应该是mt-nd2不行的mt-nd1。

我还是推荐使用Python来进行数据处理。

Question 2

您预期的显示输出在我看来不一致（2 行 - >第 1 行和第 3 行），如果这是一个拼写错误，那么您可以尝试以下操作吗？

awk 'FNR==NR{a[$1]=$2;next} ($1 in a) && (a[$1]>=$4 && a[$1]<=$5){sub("Name=","",$10);print $1,a[$1],$10}'  b.tsv a.tsv > output.tsv

Answer

您预期的显示输出在我看来不一致（2 行 - >第 1 行和第 3 行），如果这是一个拼写错误，那么您可以尝试以下操作吗？

awk 'FNR==NR{a[$1]=$2;next} ($1 in a) && (a[$1]>=$4 && a[$1]<=$5){sub("Name=","",$10);print $1,a[$1],$10}'  b.tsv a.tsv > output.tsv

使用 file_b 中 2 列的信息从 file_a 中提取名称

答案1

答案2

相关内容