.csv
我有一个这种格式的巨大文件:
"acc","lineage"
"MT993865","B.1.509"
"MW483477","B.1.402"
"MW517757","B.1.2"
"MW517758","B.1.2"
"MW592770","B.1.564"
...
accession_id
即,第一列是表示数据样本的字符串,第二列是 covid 变体lineage
。我想提取 accession_ids 及其谱系以获取一些感兴趣的特定变体,例如Omicron
ie B.1.1.529
。我尝试用 grep 来查找文件,-w
但由于.
是非单词字符,它会为我获取扩展 omicron 的变体的结果,例如,B.1.1.529.1
为了详细的讨论,请看一下我写的这个 bash 脚本:
# filter data based on the selected lineages (refer to variants_lineage.txt for more info) as given below.
# File with metadata
metadata_file="$HOME/thesis/SARS-CoV2-data/metadata.csv"
cat "$metadata_file" | tr -d '"' | tr ',' $'\t' > adj_metadata.tsv
# list of lineages of interest
selected_lineages=("B.1.1.7" "B.1.351" "P.1" "B.1.617.2" "B.1.1.5290" "C.37" "B.1.621" "B.1.429" "B.1.427" "CAL.20C" "P.2" "B.1.525" "P.3" "B.1.526" "B.1.617.1" )
pattern=$(echo ${selected_lineages[*]}|tr ' ' '|')
if [ -f "adj_metadata.tsv" ]
then
echo "File exists"
for lineage in ${selected_lineages[@]}
do
echo "Filtering for lineage $lineage"
grep -w "$lineage" adj_metadata.tsv >> filtered_metadata.tsv
done
else
echo "Adjusted metadata file does not exist."
fi
# Check for the uniqueness of the filtered_metadata.csv file, this should fetch the list of selected_lineages
cut -d$'\t' -f2 filtered_metadata.tsv | sort | uniq
非常感谢任何建议/意见。
另外,请随意评论与问题无关的改进。
先感谢您。
答案1
方法一
由于 .csv 中的字符串始终位于双引号之间"
,因此您可以在匹配中包含引号。然后,您只需'
在表达式中使用单引号即可。
例子:
asdf.csv:
"foo","B.1.1.529"
"bar","B.1.1.529.1"
╰─$ grep '"B.1.1.529"' ./asdf
"foo","B.1.1.529"
如您所见,B.1.1.529.1
在这种情况下将不匹配。
方法二
虽然方法 1 适用于您的输入数据,但不适用于 the,adj_metadata.tsv
因为它删除了所有引号。您当然可以修改脚本以首先匹配,然后通过管道输出tr
,但这将包括不必要的工作。
您可以做的是将正则表达式锚定到行尾$
例子:
adj-metadata.tsv:
foo B.1.1.529
bar B.1.1.529.1
╰─$ grep "B.1.1.529$" adj_metadata.tsv
foo B.1.1.529
使用此方法,您需要对脚本进行的唯一修改是\$
在 grep 命令中的正确位置添加:
#!/bin/bash
# filter data based on the selected lineages (refer to variants_lineage.txt for more info) as given below.
# File with metadata
metadata_file="$HOME/thesis/SARS-CoV2-data/metadata.csv"
cat "$metadata_file" | tr -d '"' | tr ',' $'\t' > adj_metadata.tsv
# list of lineages of interest
selected_lineages=("B.1.1.7" "B.1.351" "P.1" "B.1.617.2" "B.1.1.5290" "C.37" "B.1.621" "B.1.429" "B.1.427" "CAL.20C" "P.2" "B.1.525" "P.3" "B.1.526" "B.1.617.1" )
#replace all occurrences of "." with "\."
selected_lineages=$(echo $selected_lineages | sed 's/\./\\./g')
if [ -f "adj_metadata.tsv" ]
then
echo "File exists"
for lineage in ${selected_lineages[@]}
do
echo "Filtering for lineage $lineage"
grep -w "$lineage\$" adj_metadata.tsv >> filtered_metadata.tsv
done
else
echo "Adjusted metadata file does not exist."
fi
# Check for the uniqueness of the filtered_metadata.csv file, this should fetch the list of selected_lineages
cut -d$'\t' -f2 filtered_metadata.tsv | sort | uniq
注意:虽然.
通常用作任何字符的表达式,但您需要使用 a 进行转义\
才能搜索文字,.
如下所示:B\.1\.1\.529$
。
\
为了打字时的简单起见,您仍然可以保留它而不使用。