grep 查找包含句点的精确单词匹配

grep 查找包含句点的精确单词匹配

.csv我有一个这种格式的巨大文件:

"acc","lineage"
"MT993865","B.1.509"
"MW483477","B.1.402"
"MW517757","B.1.2"
"MW517758","B.1.2"
"MW592770","B.1.564"
...

accession_id即,第一列是表示数据样本的字符串,第二列是 covid 变体lineage。我想提取 accession_ids 及其谱系以获取一些感兴趣的特定变体,例如Omicronie B.1.1.529。我尝试用 grep 来查找文件,-w但由于.是非单词字符,它会为我获取扩展 omicron 的变体的结果,例如,B.1.1.529.1

为了详细的讨论,请看一下我写的这个 bash 脚本:

# filter data based on the selected lineages (refer to variants_lineage.txt for more info) as given below.

# File with metadata
metadata_file="$HOME/thesis/SARS-CoV2-data/metadata.csv"
cat "$metadata_file" | tr -d '"' | tr ',' $'\t' > adj_metadata.tsv

# list of lineages of interest
selected_lineages=("B.1.1.7" "B.1.351" "P.1" "B.1.617.2" "B.1.1.5290" "C.37" "B.1.621" "B.1.429" "B.1.427" "CAL.20C" "P.2" "B.1.525" "P.3" "B.1.526" "B.1.617.1" )
pattern=$(echo ${selected_lineages[*]}|tr ' ' '|')

if [ -f "adj_metadata.tsv" ]
then
  echo "File exists"
  for lineage in ${selected_lineages[@]}
    do
      echo "Filtering for lineage $lineage"
      grep -w "$lineage" adj_metadata.tsv >> filtered_metadata.tsv
    done
else
  echo "Adjusted metadata file does not exist."
fi

# Check for the uniqueness of the filtered_metadata.csv file, this should fetch the list of selected_lineages
cut -d$'\t' -f2 filtered_metadata.tsv | sort | uniq

非常感谢任何建议/意见。

另外,请随意评论与问题无关的改进。

先感谢您。

答案1

方法一

由于 .csv 中的字符串始终位于双引号之间",因此您可以在匹配中包含引号。然后,您只需'在表达式中使用单引号即可。

例子:

asdf.csv:

"foo","B.1.1.529"
"bar","B.1.1.529.1"
╰─$ grep  '"B.1.1.529"' ./asdf
"foo","B.1.1.529"

如您所见,B.1.1.529.1在这种情况下将不匹配。


方法二

虽然方法 1 适用于您的输入数据,但不适用于 the,adj_metadata.tsv因为它删除了所有引号。您当然可以修改脚本以首先匹配,然后通过管道输出tr,但这将包括不必要的工作。

您可以做的是将正则表达式锚定到行尾$

例子:

adj-metadata.tsv:

foo     B.1.1.529
bar     B.1.1.529.1
╰─$ grep "B.1.1.529$" adj_metadata.tsv
foo     B.1.1.529

使用此方法,您需要对脚本进行的唯一修改是\$在 grep 命令中的正确位置添加:

#!/bin/bash
# filter data based on the selected lineages (refer to variants_lineage.txt for more info) as given below.

# File with metadata
metadata_file="$HOME/thesis/SARS-CoV2-data/metadata.csv"
cat "$metadata_file" | tr -d '"' | tr ',' $'\t' > adj_metadata.tsv

# list of lineages of interest
selected_lineages=("B.1.1.7" "B.1.351" "P.1" "B.1.617.2" "B.1.1.5290" "C.37" "B.1.621" "B.1.429" "B.1.427" "CAL.20C" "P.2" "B.1.525" "P.3" "B.1.526" "B.1.617.1" )

#replace all occurrences of "." with "\."
selected_lineages=$(echo $selected_lineages | sed 's/\./\\./g')

if [ -f "adj_metadata.tsv" ]
then
  echo "File exists"
  for lineage in ${selected_lineages[@]}
    do
      echo "Filtering for lineage $lineage"
      grep -w "$lineage\$" adj_metadata.tsv >> filtered_metadata.tsv
    done
else
  echo "Adjusted metadata file does not exist."
fi

# Check for the uniqueness of the filtered_metadata.csv file, this should fetch the list of selected_lineages
cut -d$'\t' -f2 filtered_metadata.tsv | sort | uniq

注意:虽然.通常用作任何字符的表达式,但您需要使用 a 进行转义\才能搜索文字,.如下所示:B\.1\.1\.529$

\为了打字时的简单起见,您仍然可以保留它而不使用。

相关内容