递归搜索第一行包含特定字符串组合的文件

递归搜索第一行包含特定字符串组合的文件

我需要找到第一行中包含字符串的所有文件:“StockID”和“SellPrice”。

以下是一些文件示例:

1.csv:

StockID Dept    Cat2    Cat4    Cat5    Cat6    Cat1    Cat3    Title   Notes   Active  Weight  Sizestr Colorstr    Quantity    Newprice    StockCode   DateAdded   SellPrice   PhotoQuant  PhotoStatus Description stockcontrl Agerestricted
<blank> 1   0   0   0   0   22  0   RAF Air Crew Oxygen Connector   50801   1   150 <blank> <blank> 0   0   50866   2018-09-11 05:54:03 65  5   1   <br />\r\nA wartime RAF aircrew oxygen hose connector.<br />\r\n<br />\r\nAir Ministry stamped with Ref. No. 6D/482, Mk IVA.<br />\r\n<br />\r\nBrass spring loaded top bayonet fitting for the 'walk around' oxygen bottle extension hose (see last photo).<br />\r\n<br />\r\nIn a good condition.    2   0
<blank> 1   0   0   0   0   15  0   WW2 US Airforce Type Handheld Microphone    50619   1   300 <blank> <blank> 1   0   50691   2017-12-06 09:02:11 20  9   1   <br />\r\nWW2 US Airforce Handheld Microphone type NAF 213264-6 and sprung mounting Bracket No. 213264-2.<br />\r\n<br />\r\nType RS 38-A.<br />\r\n<br />\r\nMade by Telephonics Corp.<br />\r\n<br />\r\nIn a un-issued condition.    3   0
<blank> 1   0   0   0   0   22  0   RAF Seat Type Parachute Harness <blank> 1   4500    <blank> <blank> 1   0   50367   2016-11-04 12:02:26 155 8   1   <br />\r\nPost War RAF Pilot Seat Type Parachute Harness.<br />\r\n<br />\r\nThis Irvin manufactured harness is 'new old' stock and is unissued.<br />\r\n<br />\r\nThe label states Irvin Harness type C, Mk10, date 1976.<br />\r\nIt has Irvin marked buckles and complete harness straps all in 'mint' condition.<br />\r\n<br />\r\nFully working Irvin Quick Release Box and a canopy release Irvin  'D-Ring' Handle.<br />\r\n<br />\r\nThis harness is the same style type as the WW2 pattern seat type, and with some work could be made to look like one.<br />\r\n<br />\r\nIdeal for the re-enactor or collector (Not sold for parachuting).<br />\r\n<br />\r\nTotal weight of 4500 gms.   3   0

2.csv:

id  user_id organization_id hash    name    email   date    first_name  hear_about
1   2   15  <blank> Fairley [email protected] 1129889679  John    0

我只想找到第一行包含的文件:“StockID”和“SellPrice”;所以在这个例子中,我只想输出 ./1.csv

我设法做到了这一点,但我现在陷入困境;(

where=$(find "./backup -type f)
for x in $where; do
   head -1 $x | grep -w "StockID"
done

答案1

find+awk解决方案:

find ./backup -type f -exec \
awk 'NR == 1{ if (/StockID.*SellPrice/) print FILENAME; exit }' {} \;

如果关键单词的顺序可能不同,请将模式替换/StockID.*SellPrice//StockID/ && /SellPrice/


如果文件数量巨大,更有效的替代方案是(一次处理一堆文件;命令的调用总数将远小于匹配文件的数量):

find ./backup -type f -exec \
awk 'FNR == 1 && /StockID.*SellPrice/{ print FILENAME }{ nextfile }' {} +

答案2

使用 GNUgrep或兼容:

grep -Hrnm1 '^' ./backup | sed -n '/StockID.*SellPrice/s/:1:.*//p'

递归 grep 将打印每个文件的第一行并打印filename:1:line 没有读取整个文件(该-m1标志应使其在第一个匹配时退出),并且sed将打印filenameline部分与模式匹配的位置。

这将失败并显示文件名字其中包含:1:自身或换行符,但这是一个值得冒的风险,而不是放置一些慢find+awk组合,为每个文件执行另一个进程。

答案3

为了避免每个文件运行一个命令并读取整个文件,使用 GNU awk

(unset -v POSIXLY_CORRECT; exec find backup/ -type f -exec gawk '
  /\<StockID\>/ && /\<SellPrice\>/ {print FILENAME}; {nextfile}' {} +)

或者与zsh

set -o rematchpcre # where we know for sure \b is supported
for file (backup/**/*(ND.)) {
  IFS= read -r line < $file &&
   [[ $line =~ "\bStockID\b" ]] &&
   [[ $line =~ "\bSellPrice\b" ]] &&
   print -r $file
}

或者:

set -o rematchpcre
print -rl backup/**/*(D.e:'
  IFS= read -r line < $REPLY &&
   [[ $line =~ "\bStockID\b" ]] &&
   [[ $line =~ "\bSellPrice\b" ]]':)

或者在本机扩展正则表达式支持,字边界运算符bash的系统上(在其他系统上,您也可以尝试/或):\<\>[[:<:]][[:>:]]\b

RE1='\<StockId\>' RE2='\<SellPrice\>' find backup -type f -exec bash -c '
  for file do
    IFS= read -r line < "$file" &&
    [[ $line =~ $RE1 ]] &&
    [[ $line =~ $RE2 ]] &&
    printf "%s\n" "$file"
  done' bash {} +

答案4

egrep+ awk:

 egrep -Hrn 'StockID|SellPrice' ./backup | awk -F ':' '$2==1{print $1}'

相关内容