编辑2017/01/30

Question 1

我会使用perl，类似：

perl -MFile::Find -MClone=clone -lne '
  # parse the strings.txt input, here looking for the sequences of
  # 0 or more characters (.*?) in between two " characters
  for (/"(.*?)"/g) {
    # @needle is an array of associative arrays whose keys
    # are the "strings" for each line.
    $needle[$n]{$_} = undef;
  }
  $n++;

  END{
    sub wanted {
      return unless -f; # only regular files
      my $needle_clone = clone(\@needle);
      if (open FILE, "<", $_) {
        LINE: while (<FILE>) {
          # read the file line by line
          for (my $i = 0; $i < $n; $i++) {
            for my $s (keys %{$needle_clone->[$i]}) {
              if (index($_, $s)>=0) {
                # if the string is found, we delete it from the associative
                # array.
                delete $needle_clone->[$i]{$s};
                unless (%{$needle_clone->[$i]}) {
                  # if the associative array is empty, that means we have
                  # found all the strings for that $i, that means we can
                  # stop processing, and the file matches
                  print $File::Find::name;
                  last LINE;
                }
              }
            }
          }
        }
        close FILE;
      }
    }
    find(\&wanted, ".")
  }' /path/to/strings.txt

这意味着我们最大限度地减少字符串搜索的次数。

在这里，我们逐行处理文件。如果文件相当小，您可以将它们作为一个整体进行处理，这会稍微简化并可能提高性能。

请注意，它确实期望列表文件位于：

 "surveillance data" "surveillance technology" "cctv camera"
 "social media" "surveillance techniques" "enforcement agencies"
 "social control" "surveillance camera" "social security"
 "surveillance data" "security guards" "social networking"
 "surveillance mechanisms" "cctv surveillance" "contemporary surveillance"

格式，每行有一定数量（不必是 3）的带引号（带双引号）的字符串。带引号的字符串本身不能包含双引号字符。双引号字符不是正在搜索的文本的一部分。也就是说，如果列表文件包含：

"A" "B"
"1" "2" "3"

这将报告当前目录及其下面包含以下任一内容的所有常规文件的路径

两者A和B
或（不是独占或) 全部1,2和3

它们中的任何地方。

Answer

我会使用perl，类似：

perl -MFile::Find -MClone=clone -lne '
  # parse the strings.txt input, here looking for the sequences of
  # 0 or more characters (.*?) in between two " characters
  for (/"(.*?)"/g) {
    # @needle is an array of associative arrays whose keys
    # are the "strings" for each line.
    $needle[$n]{$_} = undef;
  }
  $n++;

  END{
    sub wanted {
      return unless -f; # only regular files
      my $needle_clone = clone(\@needle);
      if (open FILE, "<", $_) {
        LINE: while (<FILE>) {
          # read the file line by line
          for (my $i = 0; $i < $n; $i++) {
            for my $s (keys %{$needle_clone->[$i]}) {
              if (index($_, $s)>=0) {
                # if the string is found, we delete it from the associative
                # array.
                delete $needle_clone->[$i]{$s};
                unless (%{$needle_clone->[$i]}) {
                  # if the associative array is empty, that means we have
                  # found all the strings for that $i, that means we can
                  # stop processing, and the file matches
                  print $File::Find::name;
                  last LINE;
                }
              }
            }
          }
        }
        close FILE;
      }
    }
    find(\&wanted, ".")
  }' /path/to/strings.txt

这意味着我们最大限度地减少字符串搜索的次数。

在这里，我们逐行处理文件。如果文件相当小，您可以将它们作为一个整体进行处理，这会稍微简化并可能提高性能。

请注意，它确实期望列表文件位于：

 "surveillance data" "surveillance technology" "cctv camera"
 "social media" "surveillance techniques" "enforcement agencies"
 "social control" "surveillance camera" "social security"
 "surveillance data" "security guards" "social networking"
 "surveillance mechanisms" "cctv surveillance" "contemporary surveillance"

格式，每行有一定数量（不必是 3）的带引号（带双引号）的字符串。带引号的字符串本身不能包含双引号字符。双引号字符不是正在搜索的文本的一部分。也就是说，如果列表文件包含：

"A" "B"
"1" "2" "3"

这将报告当前目录及其下面包含以下任一内容的所有常规文件的路径

两者A和B
或（不是独占或) 全部1,2和3

它们中的任何地方。

Question 2

由于agrep您的系统中似乎不存在，请查看基于 sed 和 awk 的替代方案，以应用 grep 和本地文件读取的模式进行操作。

PS：由于您使用 osx，我不确定您拥有的 awk 版本是否支持以下用法。

awk可以用多种模式的 AND 操作来模拟 grep 的用法：
awk '/pattern1/ && /pattern2/ && /pattern3/'

所以你可以这样转换你的模式文件：

$ cat ./tmp/d1.txt
"surveillance data" "surveillance technology" "cctv camera"
"social media" "surveillance techniques" "enforcement agencies"
"social control" "surveillance camera" "social security"
"surveillance data" "security guards" "social networking"
"surveillance mechanisms" "cctv surveillance" "contemporary surveillance"

对此：

$ sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' ./tmp/d1.txt
/surveillance data/ && /surveillance technology/ && /cctv camera/
/social media/ && /surveillance techniques/ && /enforcement agencies/
/social control/ && /surveillance camera/ && /social security/
/surveillance data/ && /security guards/ && /social networking/
/surveillance mechanisms/ && /cctv surveillance/ && /contemporary surveillance/

PS：您可以使用 in the end 将输出重定向到另一个文件>anotherfile，或者您可以使用该sed -i选项在同一搜索词模式文件中进行就地更改。

然后你只需要从这个模式文件中向 awk 提供 awk 格式的模式：

$ while IFS= read -r line;do awk "$line" *.txt;done<./tmp/d1.txt #d1.txt = my test pattern file

您也不能通过在原始模式文件的每一行中应用 sed 来转换原始模式文件中的模式，如下所示：

while IFS= read -r line;do 
  line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line")
  awk "$line" *.txt
done <./tmp/d1.txt

或者单行：

$ while IFS= read -r line;do line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line"); awk "$line" *.txt;done <./tmp/d1.txt

上面的命令在我的测试文件中返回正确的 AND 结果，如下所示：

$ cat d2.txt
This guys over there have the required surveillance technology to do the job.
The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.

$ cat d3.txt
All surveillance data are locked.
All surveillance data are locked and guarded by security guards.
There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

结果：

$ while IFS= read -r line;do awk "$line" *.txt;done<./tmp/d1.txt
#or while IFS= read -r line;do line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line"); awk "$line" *.txt;done <./tmp/d1.txt
The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.
There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

更新：
上面的 awk 解决方案打印匹配的 txt 文件的内容。
如果您想显示文件名而不是内容，请在必要时使用以下 awk：

awk "$line""{print FILENAME}" *.txt

Answer

由于agrep您的系统中似乎不存在，请查看基于 sed 和 awk 的替代方案，以应用 grep 和本地文件读取的模式进行操作。

PS：由于您使用 osx，我不确定您拥有的 awk 版本是否支持以下用法。

awk可以用多种模式的 AND 操作来模拟 grep 的用法：
awk '/pattern1/ && /pattern2/ && /pattern3/'

所以你可以这样转换你的模式文件：

$ cat ./tmp/d1.txt
"surveillance data" "surveillance technology" "cctv camera"
"social media" "surveillance techniques" "enforcement agencies"
"social control" "surveillance camera" "social security"
"surveillance data" "security guards" "social networking"
"surveillance mechanisms" "cctv surveillance" "contemporary surveillance"

对此：

$ sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' ./tmp/d1.txt
/surveillance data/ && /surveillance technology/ && /cctv camera/
/social media/ && /surveillance techniques/ && /enforcement agencies/
/social control/ && /surveillance camera/ && /social security/
/surveillance data/ && /security guards/ && /social networking/
/surveillance mechanisms/ && /cctv surveillance/ && /contemporary surveillance/

PS：您可以使用 in the end 将输出重定向到另一个文件>anotherfile，或者您可以使用该sed -i选项在同一搜索词模式文件中进行就地更改。

然后你只需要从这个模式文件中向 awk 提供 awk 格式的模式：

$ while IFS= read -r line;do awk "$line" *.txt;done<./tmp/d1.txt #d1.txt = my test pattern file

您也不能通过在原始模式文件的每一行中应用 sed 来转换原始模式文件中的模式，如下所示：

while IFS= read -r line;do 
  line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line")
  awk "$line" *.txt
done <./tmp/d1.txt

或者单行：

$ while IFS= read -r line;do line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line"); awk "$line" *.txt;done <./tmp/d1.txt

上面的命令在我的测试文件中返回正确的 AND 结果，如下所示：

$ cat d2.txt
This guys over there have the required surveillance technology to do the job.
The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.

$ cat d3.txt
All surveillance data are locked.
All surveillance data are locked and guarded by security guards.
There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

结果：

$ while IFS= read -r line;do awk "$line" *.txt;done<./tmp/d1.txt
#or while IFS= read -r line;do line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line"); awk "$line" *.txt;done <./tmp/d1.txt
The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.
There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

更新：
上面的 awk 解决方案打印匹配的 txt 文件的内容。
如果您想显示文件名而不是内容，请在必要时使用以下 awk：

awk "$line""{print FILENAME}" *.txt

Question 3

这个问题有点尴尬，但你可以这样解决：

while read one two three four five six
  do grep -lF "$one $two" *files* | xargs grep -lF "$three $four" | xargs grep -lF "$five $six"
done < patterns | sort -u

这假设您的模式文件每行恰好包含六个单词（三个模式，每个模式两个单词）。该逻辑and是通过链接三个连续的过滤器 ( grep) 来实现的。请注意，这并不是特别有效。解决方案awk可能会更快。

Answer

这个问题有点尴尬，但你可以这样解决：

while read one two three four five six
  do grep -lF "$one $two" *files* | xargs grep -lF "$three $four" | xargs grep -lF "$five $six"
done < patterns | sort -u

这假设您的模式文件每行恰好包含六个单词（三个模式，每个模式两个单词）。该逻辑and是通过链接三个连续的过滤器 ( grep) 来实现的。请注意，这并不是特别有效。解决方案awk可能会更快。

Question 4

这是在我的测试中似乎有效的另一种方法。

我将字符串文件数据复制到名为 d1.txt 的文件中，并将其移动到单独的目录（即 tmp），以避免稍后 grep 匹配同一文件（d1.txt）中的字符串文件。

然后使用以下命令在此字符串文件（在我的例子中为 d1.txt）中的每个搜索词之间插入分号：sed -i 's/" "/";"/g' ./tmp/d1.txt

$ cat ./tmp/d1.txt
"surveillance data" "surveillance technology" "cctv camera"
"social media" "surveillance techniques" "enforcement agencies"
"social control" "surveillance camera" "social security"
"surveillance data" "security guards" "social networking"
"surveillance mechanisms" "cctv surveillance" "contemporary surveillance"
$ sed -i 's/" "/";"/g' ./tmp/d1.txt
$ cat ./tmp/d1.txt
"surveillance data";"surveillance technology";"cctv camera"
"social media";"surveillance techniques";"enforcement agencies"
"social control";"surveillance camera";"social security"
"surveillance data";"security guards";"social networking"
"surveillance mechanisms";"cctv surveillance";"contemporary surveillance"

然后使用命令删除双引号sed 's/"//g' ./tmp/d1.txt PS: 这可能并不是真正必要的，但我删除了双引号进行测试。

$ sed -i 's/"//g' ./tmp/d1.txt && cat ./tmp/d1.txt
surveillance data;surveillance technology;cctv camera
social media;surveillance techniques;enforcement agencies
social control;surveillance camera;social security
surveillance data;security guards;social networking
surveillance mechanisms;cctv surveillance;contemporary surveillance

不，您可以使用该程序来 grep 当前目录中的所有文件，agrep该程序旨在提供带有 AND 操作的多模式 grep。

agrep需要用分号分隔多个模式;才能将其计算为 AND。

在我的测试中，我创建了两个示例文件，其内容为：

$ cat d2.txt
This guys over there have the required surveillance technology to do the job.

The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.

$ cat d3.txt
All surveillance data are locked.
All surveillance data are locked and guarded by security guards.
There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

在当前目录上运行 agrep 将返回正确的行（使用 AND）和文件名：

$ while IFS= read -r line;do agrep "$line" *;done<./tmp/d1.txt
d2.txt: The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.
d3.txt: There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

Answer

这是在我的测试中似乎有效的另一种方法。

我将字符串文件数据复制到名为 d1.txt 的文件中，并将其移动到单独的目录（即 tmp），以避免稍后 grep 匹配同一文件（d1.txt）中的字符串文件。

然后使用以下命令在此字符串文件（在我的例子中为 d1.txt）中的每个搜索词之间插入分号：sed -i 's/" "/";"/g' ./tmp/d1.txt

$ cat ./tmp/d1.txt
"surveillance data" "surveillance technology" "cctv camera"
"social media" "surveillance techniques" "enforcement agencies"
"social control" "surveillance camera" "social security"
"surveillance data" "security guards" "social networking"
"surveillance mechanisms" "cctv surveillance" "contemporary surveillance"
$ sed -i 's/" "/";"/g' ./tmp/d1.txt
$ cat ./tmp/d1.txt
"surveillance data";"surveillance technology";"cctv camera"
"social media";"surveillance techniques";"enforcement agencies"
"social control";"surveillance camera";"social security"
"surveillance data";"security guards";"social networking"
"surveillance mechanisms";"cctv surveillance";"contemporary surveillance"

然后使用命令删除双引号sed 's/"//g' ./tmp/d1.txt PS: 这可能并不是真正必要的，但我删除了双引号进行测试。

$ sed -i 's/"//g' ./tmp/d1.txt && cat ./tmp/d1.txt
surveillance data;surveillance technology;cctv camera
social media;surveillance techniques;enforcement agencies
social control;surveillance camera;social security
surveillance data;security guards;social networking
surveillance mechanisms;cctv surveillance;contemporary surveillance

不，您可以使用该程序来 grep 当前目录中的所有文件，agrep该程序旨在提供带有 AND 操作的多模式 grep。

agrep需要用分号分隔多个模式;才能将其计算为 AND。

在我的测试中，我创建了两个示例文件，其内容为：

$ cat d2.txt
This guys over there have the required surveillance technology to do the job.

The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.

$ cat d3.txt
All surveillance data are locked.
All surveillance data are locked and guarded by security guards.
There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

在当前目录上运行 agrep 将返回正确的行（使用 AND）和文件名：

$ while IFS= read -r line;do agrep "$line" *;done<./tmp/d1.txt
d2.txt: The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.
d3.txt: There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

编辑2017/01/30

编辑2017/01/30

编辑2017/01/29

答案1

答案2

答案3

答案4

相关内容