使用 Grep 搜索列表并返回匹配项

Question

第一个更改是不要在 shell 循环中执行此操作！这意味着您对每个基因名称搜索一次文件，并且将花费比必要的时间长得多的时间。相反，使用-f选项grep将名称列表作为输入：

grep -iFxf ShortList.txt  FullList.txt > Final_List_With_Numbers

使用的选项是：

  -i, --ignore-case
          Ignore  case  distinctions in patterns and input data, 
          so that characters that differ only in case match each other.

   -F, --fixed-strings
          Interpret PATTERNS as fixed strings, not regular expressions.
   -f FILE, --file=FILE
          Obtain patterns from FILE, one per line.  If this option is 
          used multiple times or is combined with the -e (--regexp) option,
          search for all patterns given.  The empty file contains zero patterns, 
          and therefore matches nothing.
   -x, --line-regexp
          Select  only  those matches that exactly match the whole line.  
          For a regular expression pattern, this is like parenthesizing
          the pattern and then surrounding it with ^ and $.

这-x尤其重要，因为您不想LOC12345在搜索时找到LOC1.但是，如果您FullList.txt没有仅有的每行的基因名称，那么您可能想使用-w而不是-x：

   -w, --word-regexp
          Select only those lines containing matches that form whole  words.   The  test  is  that  the
          matching  substring  must  either  be at the beginning of the line, or preceded by a non-word
          constituent character.  Similarly, it must be either at the end of the line or followed by  a
          non-word  constituent  character.   Word-constituent  characters are letters, digits, and the
          underscore.  This option has no effect if -x is also specified.

现在，您显示的代码应该可以实际工作。如果 Shortlist 中的名称之一可以是 FullList 中的名称之一的子字符串，那么它会非常非常慢且低效，并且可能会返回错误的结果。如果您从未得到任何结果，我猜测您是ShortList.txt在 Windows 中创建的并且具有 Windows 样式的行结尾 ( \r\n)。这意味着循环i中的每个for i in ${LIST}都不会geneName，而是geneName\r不存在，FullList.txt因此找不到结果。

如果您在 *nix 文件上进行测试，它将按预期工作：

$ cat ShortList.txt 
name1
name2
name3

$ cat FullList.txt 
name3
name4

现在，在这些示例上运行您的确切代码：

$ LIST=$(cat ShortList.txt); for i in ${LIST}; do 
   RESULT=$(grep -i ${i} FullList.txt);     
   echo "${RESULT}" >> Final_List_With_Numbers;
 done
$ cat Final_List_With_Numbers 



name3

当然，它也包含空行，因为当您找不到匹配项时，$RESULT它是空的，但您仍在echoing 它，这意味着只会打印空行。这里使用 shell 循环是一个坏主意的另一个原因。

Answer 1