我有以下有关 snps ID 的数据集
POS ID
78599583 rs987435
33395779 rs345783
189807684 rs955894
33907909 rs6088791
75664046 rs11180435
218890658 rs17571465
127630276 rs17011450
90919465 rs6919430
和基因参考文件
genename name chrom strand txstart txend
CDK1 NM_001786 chr10 + 62208217 62224616
CALB2 NM_001740 chr16 + 69950116 69981843
STK38 NM_007271 chr6 - 36569637 36623271
YWHAE NM_006761 chr17 - 1194583 1250306
SYT1 NM_005639 chr12 + 77782579 78369919
ARHGAP22 NM_001347736 chr10 - 49452323 49534316
PRMT2 NM_001535 chr21 + 46879934 46909464
CELSR3 NM_001407 chr3 - 48648899 48675352
我正在尝试将基因与 SNps 位置相匹配,因此请包括具有
POS >= txstart 且 POS<= txend
例如我想要一个包含以下列的数据集
基因名 SNPID 染色体位置 txstart txend
答案1
据我所知,您的示例文件不包含您描述的任何匹配项。
如果我们将第一个文件修改为
CHROM POS ID
chr7 78599583 rs987435
chr15 33395779 rs345783
chr1 189807684 rs955894
chr20 33907909 rs6088791
chrx 1234567 rsMadeUp
chr12 75664046 rs11180435
chr1 218890658 rs17571465
chr4 127630276 rs17011450
chr6 90919465 rs6919430
使得编造的条目落在范围内
genename name chrom strand txstart txend
CDK1 NM_001786 chr10 + 62208217 62224616
CALB2 NM_001740 chr16 + 69950116 69981843
STK38 NM_007271 chr6 - 36569637 36623271
YWHAE NM_006761 chr17 - 1194583 1250306
SYT1 NM_005639 chr12 + 77782579 78369919
ARHGAP22 NM_001347736 chr10 - 49452323 49534316
PRMT2 NM_001535 chr21 + 46879934 46909464
CELSR3 NM_001407 chr3 - 48648899 48675352
然后
awk '
NR == FNR && FNR > 1 {snp[$2]=$3; next}
FNR > 1 {
for (p in snp) {if (p>=$5 && p<=$6) print $1, snp[p], $3, p, $5, $6}
}
' snpsid generef
YWHAE rsMadeUp chr17 1234567 1194583 1250306
答案2
您可以使用 awk 来实现此目的:
awk 'FNR==1 {next} FILENAME=="snipsid" {k++; POS[k]=$2; ID[k]=$2;} \
FILENAME=="gene" {i++; if(POS[i]>=$5 && POS[i]<=$6) \
print $1, ID[i], $3, POS[i], $5, $6} \
' snipsid gene >out_file