如何在 awk cli 中使用未缩进行作为记录分隔符

如何在 awk cli 中使用未缩进行作为记录分隔符

我有一个日志文件,如下所示:

2016-05-31 09:54:36 (16667) heritage_w?
  From: ip68-8-49-100.sd.sd.cox.net
  User: user1wizard (wizard)
  Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?i=290
  #accesses 3,435 (#welcome 415) since 03/07/2012
2016-05-31 09:54:41 (16677) heritage_w?w=
  From: ip68-8-49-100.sd.sd.cox.net
  User: user1wizard (wizard)
  Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?
  #accesses 3,436 (#welcome 416) since 03/07/2012
2016-06-01 04:07:06 (22190) heritage_w?m=MOD_IND;i=88
  From: ubunzeus
  User: user2 (wizard)
  Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?i=88
  #accesses 3,623 (#welcome 441) since 03/07/2012    
2016-06-01 04:07:38 (22255) heritage_w?m=MOD_FAM;i=28;ip=88
  From: ubunzeus
  User: user2 (wizard)
  Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?m=MOD_IND;i=88
  #accesses 3,624 (#welcome 441) since 03/07/2012

我正在尝试将无凹痕的线条作为 Record Separator RS

使用类似于以下的代码:

$ gawk 'BEGIN{RS="^2016"}; /user1/ {print}'

我希望只打印其中包含“user1”的记录。

目前命令行正在打印整个文件......所有记录。

这是预期的输出:

2016-05-31 09:54:36 (16667) heritage_w?
  From: ip68-8-49-100.sd.sd.cox.net
  User: user1wizard (wizard)
  Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?i=290
  #accesses 3,435 (#welcome 415) since 03/07/2012
2016-05-31 09:54:41 (16677) heritage_w?w=
  From: ip68-8-49-100.sd.sd.cox.net
  User: user1wizard (wizard)
  Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?
  #accesses 3,436 (#welcome 416) since 03/07/2012

澄清这个问题的具体情况

我接受了答案约翰1024这使我能够挑选出所需的记录。但是,我希望有人最终能够了解如何使用特定的正则表达式功能作为记录分隔符 (RS) 变量,在本例中将是未缩进的行。

我按照 John1024 的描述获取了我正在使用的字符串,并以各种组合使用了非白色正则表达式,但它不起作用。

我使用的无法正确过滤记录的行是:

$ gawk 'BEGIN{RS='\n\S'}; /user1/ {print}' event.log
$ gawk 'BEGIN{RS='\S'}; /user1/ {print}' event.log
$ gawk 'BEGIN{RS="\n^\S"}; /user1/ {print}' event.log
$ gawk 'BEGIN{RS="^\S"}; /user1/ {print}' event.log

以上所有组合显示所有记录。我确信单引号'^\S'使用的是实际字符而不是转义含义。双引号"^\S"给出了错误消息:

gawk: cmd. line:1: warning: escape sequence `\S' treated as plain `S'

我能够验证“\S”将正则表达式非白色第一列字符。它在线显示未缩进的行:

$ egrep "^\S" event.log

上述 cli 的输出:

2016-05-31 09:54:36 (16667) heritage_w?
2016-05-31 09:54:41 (16677) heritage_w?w=
2016-06-01 04:07:06 (22190) heritage_w?m=MOD_IND;i=88
2016-06-01 04:07:38 (22255) heritage_w?m=MOD_FAM;i=28;ip=88

在接受的答案的帮助下...换行代码并使用双反斜杠解决转义字符错误,以下过滤所需的记录:

$ gawk 'BEGIN{RS="\n\\S"}; /user1/ {print}' event.log

答案1

尝试:

$ gawk 'BEGIN{RS="\n2016"}; /user1/ {print}' input

这会产生输出;

2016-05-31 09:54:36 (16667) heritage_w?
  From: ip68-8-49-100.sd.sd.cox.net
  User: user1wizard (wizard)
  Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?i=290
  #accesses 3,435 (#welcome 415) since 03/07/2012
-05-31 09:54:41 (16677) heritage_w?w=
  From: ip68-8-49-100.sd.sd.cox.net
  User: user1wizard (wizard)
  Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?
  #accesses 3,436 (#welcome 416) since 03/07/2012

请注意,第二条记录缺少最初的2016。那是。当然,因为那2016成为了记录分隔符的一部分。如果您想在开始任何记录处理之前恢复该部分:

gawk 'BEGIN{RS="\n2016"} NR>1{$0="2016" $0;} /user1/ {print}' input

改进

此版本根据需要将文本恢复到每行的开头:

gawk '{$0=substr(last,2)$0;} /user1/{print} {last=RT}' RS='\n[^[:space:]]' input

怎么运行的:

  • {$0=substr(last,2)$0;} 添加到$0已被记录分隔符删除的文本之前。 substr用于删除前面的换行符。

  • /user1/{print}打印我们感兴趣的记录。

  • {last=RT}保存实际的记录分隔符,以便将其一部分添加到下一条记录的前面。 RT是 GNU 扩展,其他版本的 awk 不支持。

  • RS='\n[^[:space:]]'将记录分隔符设置为换行符,后跟任何非空格。使用正则表达式作为记录分隔符可与 GNU awk 配合使用。

例子:

$ gawk '{$0=substr(last,2)$0;} /user1/{print} {last=RT}' RS='\n[^[:space:]]' input
2016-05-31 09:54:36 (16667) heritage_w?
  From: ip68-8-49-100.sd.sd.cox.net
  User: user1wizard (wizard)
  Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?i=290
  #accesses 3,435 (#welcome 415) since 03/07/2012
2016-05-31 09:54:41 (16677) heritage_w?w=
  From: ip68-8-49-100.sd.sd.cox.net
  User: user1wizard (wizard)
  Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?
  #accesses 3,436 (#welcome 416) since 03/07/2012

答案2

这是一个稍微不同的策略。我们将每条缩进行累积到一个保存缓冲区中。当读取未缩进的行时,我们调用一个函数,该函数打印缓冲区(如果它包含所需的模式),然后用新的标题行替换缓冲区内容。当到达文件末尾时,我们还需要调用该函数。

#!/usr/bin/awk -f
#   Select records from a file 
#   Each record header line is unindented and each record body line is indented
#   Written by PM 2Ring 2015.06.02

function ShowSelected()
{
    if (hold ~ /User: user1/)
        printf "%s", hold
    hold = $0 ORS
}

/^ /{hold = hold $0 ORS; next}

{ShowSelected()}

END{ShowSelected()}

这是一个单行版本:

awk 'function S(){if(h~/User: user1/)printf "%s",h; h=$0 ORS}; /^ /{h=h $0 ORS; next}; {S()};END{S()}'

只是为了好玩,这里有一个 sed 版本。它本质上使用相同的算法。

sed '/^ /!bA;H;$bA;d;:A;x;/User: user1/!d'

这是同样的事情,带有评论。

#!/bin/sed -f    
#   Select records from a file 
#   Each record header line is unindented and each record body line is indented
#   Written by PM 2Ring 2015.06.02

# If line doesn't start with a space, branch to the select & display routine
/^ /!bA

# Append pattern space (i.e., the current line) to the hold space
H

# If this is the last line, branch to the select & display routine
$bA

# Delete the pattern space and start the next cycle
d

# The select & display routine
:A

# Exchange the contents of the hold and pattern spaces
x

# Delete the pattern if it doesn't contain the regex /User: user1/
# if the pattern isn't deleted it will be printed
/User: user1/!d

这是一种 sed - awk 混合方法,受到 Thor 使用 sed 进行一些预处理的想法的启发。我们为每个未缩进的行添加一个\xff字符作为前缀,然后将其用作 awk 记录分隔符。如果日志文件\xff本身使用该字符,则这将无法正常工作,但希望情况不会如此。 :)

<logfile sed 's/^[^ ]/\xff&/' | awk 'BEGIN{RS="\xff";ORS=""};/User: user1/'

答案3

我会用例如预处理文件sed。因此,要提取每个记录的第二行,请执行以下操作:

<infile sed 's/^[^ ]/&\n/' | awk '{ print $2 }' RS= FS='\n'

输出:

  From: ip68-8-49-100.sd.sd.cox.net
  From: ip68-8-49-100.sd.sd.cox.net
  From: ubunzeus
  From: ubunzeus

编辑 - 如何打印包含以下$3内容的每条记录user1

<infile sed '1!s/^[^ ]/\n&/' | awk '$3 ~ /user1/' RS= FS='\n'

输出:

2016-05-31 09:54:36 (16667) heritage_w?                                
  From: ip68-8-49-100.sd.sd.cox.net
  User: user1wizard (wizard)
  Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?i=290
  #accesses 3,435 (#welcome 415) since 03/07/2012
2016-05-31 09:54:41 (16677) heritage_w?w=
  From: ip68-8-49-100.sd.sd.cox.net
  User: user1wizard (wizard)
  Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?
  #accesses 3,436 (#welcome 416) since 03/07/2012

答案4

IMO,最简单的方法是将sed输入转换为段落分隔的记录(每个记录之间有一个或多个空行)。换句话说,跳过第一行,在不以空格(空格或制表符)开头的每一行之前插入换行符。

然后您可以告诉awk使用两个或多个换行符作为输入记录分隔符(RS)RS='\n\n+'

顺便说一句,除非您希望输出也位于段落中,否则无需将输出记录分隔符 (ORS) 设置为相同。你没有要求这个,所以我没有包括它。如果这就是您想要的(例如,因为您想对输出进行一些进一步处理),则添加-v ORS='\n\n'awk选项中。

$ sed -e '2,$ s/^[^[:blank:]]/\n&/' ldjames.txt | 
    awk -v RS='\n\n+' '/user1/ {print}'
2016-05-31 09:54:36 (16667) heritage_w?
  From: ip68-8-49-100.sd.sd.cox.net
  User: user1wizard (wizard)
  Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?i=290
  #accesses 3,435 (#welcome 415) since 03/07/2012
2016-05-31 09:54:41 (16677) heritage_w?w=
  From: ip68-8-49-100.sd.sd.cox.net
  User: user1wizard (wizard)
  Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
  Referer: http://dbase.apollo3.com/heritage_w?
  #accesses 3,436 (#welcome 416) since 03/07/2012

相关内容