匹配两个文件并保留包含匹配项的块

Question 1

好的，让我们尝试一下我编写的这个小脚本：

#!/usr/bin/env bash
set -e

match1=/home/george/Documents/askubuntu/matchme/match1
match2=/home/george/Documents/askubuntu/matchme/match2

# Create the result file
touch results.txt

while read -r word
do
     if [[ "$word" = $(grep -o "$word" "$match1") ]]; then
             if [[ "$word" != $(grep -o "$word" "results.txt") ]]
             then
                     grep "$(grep "$word" "$match1" | grep -o "[[:digit:]]..$")" "$match1" >> "results.txt"
                     while read -r new
                     do                                 
                             if [[ "$new" =~ $word ]]; then
                                     # Replace the words
                                     sed -i "s/$word/$new/" "results.txt"
                             fi
                     done < <(grep  -o "$word_.*\." "$match2" | sed -e 's/\.//')
                     # Add space between results
                     echo " " >> "results.txt"
             fi
     fi
done < <(cut -d"_" -f1 "$match2")

# Remove last blank line from the results file
sed -i '$ d' results.txt

解释：

match1：包含过滤器源
match2：包含过滤条件
set -e：发生错误时停止脚本
(grep -o "$word_.*\." "$match2" | sed -e 's/\.//')pdb：读取过滤文件并抓取扩展名以下的名称

命令过程描述：

使用命令从文件(1KBA,1A3L, ,1F94, 1A3U, 1A3V, 1A4H)cut中获取过滤条件，然后match2
从命令结果中读取cut并在源文件中查找匹配项match1，
grep如果在源文件中找到该块的匹配项，则发送或打印到新文件result.txt

笔记： 请根据您的喜好修改名称和其他参数。

结果：

$cat results.txt 
3LKB_BUNMU  Bungarus multicinctus   P01398  PDB; 1KBA_GAL; X-ray; 2.30 A; A/B=22-87.
                                        PDB; 2NBT; NMR; -; A/B=22-87.

3NOJ_BUNCA  Bungarus candidus   P81782  PDB; 1F94_; X-ray; 0.97 A; A=1-63.
                                    PDB; 1IJC; NMR; -; A=1-63.

Answer

好的，让我们尝试一下我编写的这个小脚本：

#!/usr/bin/env bash
set -e

match1=/home/george/Documents/askubuntu/matchme/match1
match2=/home/george/Documents/askubuntu/matchme/match2

# Create the result file
touch results.txt

while read -r word
do
     if [[ "$word" = $(grep -o "$word" "$match1") ]]; then
             if [[ "$word" != $(grep -o "$word" "results.txt") ]]
             then
                     grep "$(grep "$word" "$match1" | grep -o "[[:digit:]]..$")" "$match1" >> "results.txt"
                     while read -r new
                     do                                 
                             if [[ "$new" =~ $word ]]; then
                                     # Replace the words
                                     sed -i "s/$word/$new/" "results.txt"
                             fi
                     done < <(grep  -o "$word_.*\." "$match2" | sed -e 's/\.//')
                     # Add space between results
                     echo " " >> "results.txt"
             fi
     fi
done < <(cut -d"_" -f1 "$match2")

# Remove last blank line from the results file
sed -i '$ d' results.txt

解释：

match1：包含过滤器源
match2：包含过滤条件
set -e：发生错误时停止脚本
(grep -o "$word_.*\." "$match2" | sed -e 's/\.//')pdb：读取过滤文件并抓取扩展名以下的名称

命令过程描述：

使用命令从文件(1KBA,1A3L, ,1F94, 1A3U, 1A3V, 1A4H)cut中获取过滤条件，然后match2
从命令结果中读取cut并在源文件中查找匹配项match1，
grep如果在源文件中找到该块的匹配项，则发送或打印到新文件result.txt

笔记： 请根据您的喜好修改名称和其他参数。

结果：

$cat results.txt 
3LKB_BUNMU  Bungarus multicinctus   P01398  PDB; 1KBA_GAL; X-ray; 2.30 A; A/B=22-87.
                                        PDB; 2NBT; NMR; -; A/B=22-87.

3NOJ_BUNCA  Bungarus candidus   P81782  PDB; 1F94_; X-ray; 0.97 A; A=1-63.
                                    PDB; 1IJC; NMR; -; A=1-63.

Question 2

我建议awk使用段落模式例如

awk 'NR==FNR {
       sub(/_[^_]*$/,"",$1); a[$1]++; next
     } 
     {
       for (x in a) {
         if ($0 ~ "PDB; "x) {print; break;}
       }
     }' file2 RS= file1

前任。：

$ awk 'NR==FNR {sub(/_[^_]*$/,"",$1); a[$1]++; next} {for (x in a) {if ($0 ~ "PDB; "x) {print; break;}}}' file2 RS= file1
3LKB_BUNMU  Bungarus multicinctus   P01398  PDB; 1KBA; X-ray; 2.30 A; A/B=22-87.
                                            PDB; 2NBT; NMR; -; A/B=22-87.
3NOJ_BUNCA  Bungarus candidus   P81782  PDB; 1F94; X-ray; 0.97 A; A=1-63.
                                        PDB; 1IJC; NMR; -; A=1-63.

如果你希望每个块后都有一个空白行，你可以更改{print; break;}为{print $0"\n"; break;}或，{printf "%s\n\n", $0; break}但请注意，这将在最后一条记录后添加一个尾随空白行，而原来可能没有这样的行 - 如果你有 GNU awk ( )，你可以通过访问包含每个记录的实际分隔符的gawk特殊变量来避免这种情况。RT{printf "%s%s", $0, RT; break;}

Answer

我建议awk使用段落模式例如

awk 'NR==FNR {
       sub(/_[^_]*$/,"",$1); a[$1]++; next
     } 
     {
       for (x in a) {
         if ($0 ~ "PDB; "x) {print; break;}
       }
     }' file2 RS= file1

前任。：

$ awk 'NR==FNR {sub(/_[^_]*$/,"",$1); a[$1]++; next} {for (x in a) {if ($0 ~ "PDB; "x) {print; break;}}}' file2 RS= file1
3LKB_BUNMU  Bungarus multicinctus   P01398  PDB; 1KBA; X-ray; 2.30 A; A/B=22-87.
                                            PDB; 2NBT; NMR; -; A/B=22-87.
3NOJ_BUNCA  Bungarus candidus   P81782  PDB; 1F94; X-ray; 0.97 A; A=1-63.
                                        PDB; 1IJC; NMR; -; A=1-63.

如果你希望每个块后都有一个空白行，你可以更改{print; break;}为{print $0"\n"; break;}或，{printf "%s\n\n", $0; break}但请注意，这将在最后一条记录后添加一个尾随空白行，而原来可能没有这样的行 - 如果你有 GNU awk ( )，你可以通过访问包含每个记录的实际分隔符的gawk特殊变量来避免这种情况。RT{printf "%s%s", $0, RT; break;}

匹配两个文件并保留包含匹配项的块

答案1

答案2

相关内容