在巨大的有序文本文件中提取两个字符串之间的文本

Question 1

对于非常大的文件，您可以利用前缀时间戳的自然顺序来使用该实用程序对和字符串look的最大公共前缀执行快速二分搜索。然后可以进行/后处理，从的输出中提取感兴趣的行startendawksedlook

在bash

export start='"2018-04-05 13:00:00"'
export end='"2018-04-05 13:05:00"'
#determine common prefix ("2018-04-05 13:0 in this example)
common_prefix=$(awk 'BEGIN {
   start=ENVIRON["start"]; end=ENVIRON["end"];
   len=length(start) > length(end)? length(end): length(start); 
   i=1;
   while (i <= len && substr(ENVIRON["start"], i, 1) == substr(ENVIRON["end"], i, 1)) {
       ++i
   }
    print(substr(start, 1, i-1))
}' </dev/null
)
#the -b option to look forces binary search. 
#My version of look on Ubuntu needs this flag to be passed, 
#some other versions of look perform a binary search by default and do not support a -b.
look -b "$common_prefix" file | awk '$0 ~ "^"ENVIRON["start"],$0 ~ "^"ENVIRON["end"]'

Answer

对于非常大的文件，您可以利用前缀时间戳的自然顺序来使用该实用程序对和字符串look的最大公共前缀执行快速二分搜索。然后可以进行/后处理，从的输出中提取感兴趣的行startendawksedlook

在bash

export start='"2018-04-05 13:00:00"'
export end='"2018-04-05 13:05:00"'
#determine common prefix ("2018-04-05 13:0 in this example)
common_prefix=$(awk 'BEGIN {
   start=ENVIRON["start"]; end=ENVIRON["end"];
   len=length(start) > length(end)? length(end): length(start); 
   i=1;
   while (i <= len && substr(ENVIRON["start"], i, 1) == substr(ENVIRON["end"], i, 1)) {
       ++i
   }
    print(substr(start, 1, i-1))
}' </dev/null
)
#the -b option to look forces binary search. 
#My version of look on Ubuntu needs this flag to be passed, 
#some other versions of look perform a binary search by default and do not support a -b.
look -b "$common_prefix" file | awk '$0 ~ "^"ENVIRON["start"],$0 ~ "^"ENVIRON["end"]'

Question 2

打印“2018-04-05 13:00:00”和“2018-04-05 13:05:00”之间的行

sed -n '/2018-04-05 13:00:00/,/2018-04-05 13:05:00/p' file

或者

sed -n /"2018-04-05 13:00:00"/,/"2018-04-05 13:05:00"/p file

Grep 开始日期“2018-04-05 13:00:00”并输出接下来的 5 行（= 5 分钟），-m1在第一个匹配后停止搜索。

grep -m1 -A5 '2018-04-05 13:00:00' file

Answer

打印“2018-04-05 13:00:00”和“2018-04-05 13:05:00”之间的行

sed -n '/2018-04-05 13:00:00/,/2018-04-05 13:05:00/p' file

或者

sed -n /"2018-04-05 13:00:00"/,/"2018-04-05 13:05:00"/p file

Grep 开始日期“2018-04-05 13:00:00”并输出接下来的 5 行（= 5 分钟），-m1在第一个匹配后停止搜索。

grep -m1 -A5 '2018-04-05 13:00:00' file

在巨大的有序文本文件中提取两个字符串之间的文本

答案1

答案2

相关内容