我有一个名为snp_data
包含SNP(单核苷酸多态性)染色体数据。这是一个 3 列、以空格分隔的 CSV 文件,其格式如下:
user@host:~$ cat snp_data
snp_id chromosome position
Chr01__912 1 912 1
Chr01__944 1 944 1
Chr01__1107 1 1107 1
Chr01__1118 1 1118 1
Chr01__1146 1 1146 1
Chr01__1160 1 1160 1
...
...
...
Chr17__214708367 17 214708367
Chr17__214708424 17 214708424
Chr17__214708451 17 214708451
Chr17__214708484 17 214708484
Chr17__214708508 17 214708508
请注意,对于每一行,该字段都具有对应和值的snp_id
形式。Chr{chromosome}__{position}
chromosome
position
window
我有另一个名为包含辅助数据的文件。这是一个 5 列、以空格分隔的 CSV 文件,其格式如下:
user@host:~$ cat window
seqname chromosome start end width
1 Chr1 1 15000 15000
2 Chr1 15001 30000 15000
3 Chr1 30001 45000 15000
4 Chr1 45001 60000 15000
5 Chr1 60001 75000 15000
6 Chr1 75001 90000 15000
...
...
...
199954 Chr17 214620001 214635000 15000
199955 Chr17 214635001 214650000 15000
199956 Chr17 214650001 214665000 15000
199957 Chr17 214665001 214680000 15000
199958 Chr17 214680001 214695000 15000
199959 Chr17 214695001 214708580 13580
window
注意和文件之间的对应关系是由文件中字段snp_data
的值和文件中和字段的值决定的,例如,值为in的行对应于值为for的行,并且其行以 a 开头的前缀。chromosome
window
chromosome
snp_id
snp_data
"Chr1"
window
snp_data
1
chromosome
snp_id
Chr01__
对于snp_data
文件中的每一行(每个染色体内的每个 snp),如果该行的值落在该特定染色体的任何行的和值position
给定的范围内,则将文件中的附加到文件中的行。start
end
window
seqname
window
snp_data
对于上面给出的输入,这将是我想要的输出:
user@host:~$ cat desired_output
snp_id chromosome position window
Chr01__912 1 912 1
Chr01__944 1 944 1
Chr01__1107 1 1107 1
...
...
...
Chr17__214708367 17 214708367 199959
Chr17__214708424 17 214708424 199959
Chr17__214708451 17 214708451 199959
Chr17__214708484 17 214708484 199959
Chr17__214708508 17 214708508 199959
要点是位置仅在每个染色体内是唯一的,因此我需要逐个染色体地比较这两个文件(即分别针对每个染色体)。我怎样才能做到这一点?
答案1
这是一个可以执行您想要的操作的 Python 脚本:
#!/usr/bin/env python2
# -*- coding: ascii -*-
"""intersect_snp.py"""
import sys
# Read data from the SNP file into a list
snp_list = []
with open(sys.argv[1], 'r') as snp_file:
for line in snp_file:
snp_row = line.split()
snp_list.append(snp_row)
# Read data from the "window" file into a dictionary of lists
win_dict = {}
with open(sys.argv[2], 'r') as win_file:
for line in win_file:
seqnames, chromosome, start, end, width = win_row = line.split()
if chromosome not in win_dict:
win_dict[chromosome] = []
win_dict[chromosome].append(win_row)
# Compare data and compute results
results = []
# Iterate over snp data rows
for snp_row in snp_list:
# Extract the field values for each snp row
snp_id, chromosome, position = snp_row
# Convert the chromosome to match the format in the "window" file
# i.e. `1` -> `Chr1`
chromosome_name = "Chr{}".format(chromosome)
# Iterate over the "window" rows corresponding to that chromosome
for win_row in win_dict.get(chromosome_name, []):
# Extract the field values for each "window" row
seqnames, chromosome, start, end, width = win_row
# Perform the desired comparison
if int(start) <= int(position) <= int(end):
# If the comparison returns true, construct the result row
result = [snp_id, chromosome, position, seqnames]
results.append(result)
break
# Print the output column headers
columns = ["snp_id", "chromosome", "position", "window"]
print(" ".join(columns))
# Print the results
for row in results:
print(' '.join(row))
请注意,此脚本假定所有行都是数据行。如果您的输入文件已命名snp_data
,window
那么您可以像这样执行它:
python intersect_snp.py snp_data window
如果您的文件有标题行,那么您可以使用tail
跳过/删除标题并像这样执行它:
python intersect_snp.py <(tail -n+2 snp_data) <(tail -n+2 window)
假设这是您的snp_data
文件:
snp_id chromosome position
Chr01__912 1 912
Chr01__944 1 944
Chr01__1107 1 1107
...
...
...
Chr17__214708367 17 214708367
Chr17__214708424 17 214708424
Chr17__214708451 17 214708451
Chr17__214708484 17 214708484
Chr17__214708508 17 214708508
这是你的window
文件:
seqnames chromosome start end width
1 Chr1 1 15000 15000
2 Chr1 15001 30000 15000
3 Chr1 30001 45000 15000
4 Chr1 45001 60000 15000
5 Chr1 60001 75000 15000
...
...
...
199954 Chr17 214620001 214635000 15000
199955 Chr17 214635001 214650000 15000
199956 Chr17 214650001 214665000 15000
199957 Chr17 214665001 214680000 15000
199958 Chr17 214680001 214695000 15000
199959 Chr17 214695001 214708580 13580
然后如果我们运行这个命令:
python intersect_snp.py <(tail -n+2 snp_data) <(tail -n+2 window)
我们得到以下输出:
snp_id chromosome position window
Chr01__912 Chr1 912 1
Chr01__944 Chr1 944 1
Chr01__1107 Chr1 1107 1
...
...
...
Chr17__214708367 Chr17 214708367 199959
Chr17__214708424 Chr17 214708424 199959
Chr17__214708451 Chr17 214708451 199959
Chr17__214708484 Chr17 214708484 199959
Chr17__214708508 Chr17 214708508 199959
答案2
为了避免长时间的等待,您可以使用 Linux 中经常预安装的简约 SQL 引擎 SQLite 来完成此操作。它不运行服务器,而是使用存储在文件中的 SQL 数据库。
在你的 snp_data & window 目录中执行以下操作:
cat snp_data | tr -s " " > snp_data.csv
sed 's#Chr##g' window | tr -s " " > window.csv
这会标准化字段之间的空间并为导入做好准备。
然后将此数据导入 SQL 并运行查询以获取输出:
cat > task.sql
CREATE TABLE snp_data (snp_id text,chromosome int, position int);
CREATE TABLE window (seqname int,chromosome int, c_start int , c_end int, c_width int);
.mode csv
.separator " "
.import snp_data.csv snp_data
.import window.csv window
.mode column
.header on
SELECT D.snp_id, D.chromosome, D.position, W.seqname FROM snp_data D, window W WHERE W.chromosome=D.chromosome AND D.position BETWEN W.c_start AND W.c_end;
[CTRL+D此处停止输入]
最后:
cat task.sql | sqlite3 my_database.db
一般来说,对于大文件来说这应该更快。