我是 Unix 新手,我有一个需要分析的日志文件。以下是我的示例日志文件:
Container:container_e182_1234
=============================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
Log Contents:
LogType:stderr
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:3000
Log Contents:
20/06/25 12:19:33 INFO datasources.FileScanRDD: Reading File path: hdfs://bpaiddev/dev/data/warehouse/clean/falcon/ukc/
20/06/25 12:19:39 ERROR Exception found
java.io.Exception:Not initiated
at.apache.java.org.Exception(132)
20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.
20/06/25 12:20:41 WARN Warning as the node is accessed without started
LogType:stdout
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:0
Log Contents:
Container:container_e182_1234
=============================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:0
Log Contents:
LogType:stderr
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:3000
Log Contents:
LogType:stdout
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:0
Log Contents:
预期产出
stderr
Thu Jun 25 12:24:52 +0100 2020
3000
20/06/25 12:19:39 ERROR Exception found
java.io.Exception:Not initiated
at.apache.java.org.Exception(132)
20/06/25 12:20:41 WARN Warning as the node is accessed without started
输出必须仅包含 ERROR 和 WARN 以及上述其他详细信息
日志档案:
Container:container_e182_1234
=============================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
Log Contents:
LogType:stderr
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:3000
Log Contents:
20/06/25 12:19:33 INFO datasources.FileScanRDD: Reading File path: hdfs://bpaiddev/dev/data/warehouse/clean/falcon/ukc/masked_data/parquet/FRAUD_CUSTOMER_INFORMATION/rcd_crt_dttm_yyyymmdd=20200523/part-0042-ed52abc2w.c000.snapp.parquet, range:0-27899, partition values :[20200523]
20/06/25 12:19:39 ERROR Exception found
java.io.Exception:Not initated
at.apache.java.org........
20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.
20/06/25 12:20:41 WARN Warning as the node is accessed without started
LogType:stdout
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:0
Log Contents:
Container:container_e182_1234
=============================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:0
Log Contents:
LogType:stderr
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:3000
Log Contents:
20/06/25 12:19:33 INFO datasources.FileScanRDD: Reading File path: hdfs://bpaiddev/dev/data/warehouse/clean/falcon/ukc/masked_data/parquet/FRAUD_CUSTOMER_INFORMATION/rcd_crt_dttm_yyyymmdd=20200523/part-0042-ed52abc2w.c000.snapp.parquet, range:0-27899, partition values :[20200523]
20/06/25 12:19:34 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.
LogType:stdout
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:0
Log Contents:
这个怎么做 ?请帮我解决这个问题。多谢!
答案1
您可以使用sed
以下单行代码来达到相同的目的(假设您的文件名为file
):
sed -n 's/^.*LogType:\(stderr\)$/\1/p; s/^.*Log Upload Time :\(.*\)/\1/p; s/^.*LogLength:\(.*\)$/\1/p; s/^.*\(ERROR\|WARN\).*$/\0/p' file
然后您可以使用重定向 ( >
) 将其输出保存到另一个文件。
分成多行以便于阅读:
sed -n -e 's/^.*LogType:\(stderr\)$/\1/p' \
-e 's/^.*Log Upload Time :\(.*\)/\1/p' \
-e 's/^.*LogLength:\(.*\)$/\1/p' \
-e 's/^.*\(ERROR\|WARN\).*$/\0/p' file
更新
上述解决方案不排除不属于 OP 请求的“LogType:stderr”的块;需要非本地信息(不在同一行),这些信息不适合sed
单独处理。
以下脚本同时使用awk
和sed
, (该awk
部分的灵感来自这个帖子),完成以下工作:
#!/bin/bash
file=$1
awk '{
if($0 ~ /LogType/){
if(hold ~ /LogType:stderr/){
print hold;
}
hold=$0
}else{
hold=hold "\n" $0
}
}END{
if(hold ~ /LogType:stderr/){
print hold
}
}' $file | sed -n -e 's/^.*LogType:\(stderr\)$/\1/p' \
-e 's/^.*Log Upload Time :\(.*\)/\1/p' \
-e 's/^.*LogLength:\(.*\)$/\1/p' \
-e 's/^.*\(ERROR\|WARN\).*$/\0/p'
答案2
我可以使用一个简短的脚本来完成它。原始日志包含在该文件中logdata
。
#!/bin/bash
tmpfile="/tmp/$0.$$"
sed -n '/stderr/,/^ *$/p' logdata > "$tmpfile"
sed -n 's/^.*LogType:\(.*\)/\1/p
s/^.*Log Upload Time :\(.*\)/\1/p
s/^.*LogLength:\(.*\)/\1/p' "$tmpfile"
grep -E "(ERROR|WARN)" "$tmpfile"
rm "$tmpfile"
首先,我们将stderr
块提取到临时文件中。然后,取出两个字段,然后取出grep
错误和警告。我试图使用连接最后两个步骤,tee
但没有成功。
我可以在没有临时文件的情况下做到这一点
sed -n '/stderr/,/^ *$/p' logdata | \
sed -n 's/^.*LogType:\(.*\)/\1/p
s/^.*Log Upload Time :\(.*\)/\1/p
s/^.*LogLength:\(.*\)/\1/p
/ERROR/p
/WARN/p'
答案3
和awk
:
awk '
/LogType:stderr/ || (p && /Log( Upload Time|Length)/){
p=1 # set flag for stderr block
sub(/^[^:]+:/, "") # replace content before `:` including `:`
print # print (modified) line
}
p && / (WARN|ERROR) /{
sub(/^[^0-9]*/, "") # remove unknown prefix
print
}
/LogType:stdout/{ exit } # exit the script
' file
答案4
使用GNU sed
和利用它的扩展正则表达式模式。
sed -Ee '
/LogType:stderr/,/^\s*$/!d
/Log Contents:/,/^\s*$/!{
s/^[^:]*://;b
}
/\s(ERROR|WARN)\s/!d
' logfile
解释:
我们将文件划分为范围(日志类型为空行),然后将每个范围细分为(前日志内容和后日志内容)
在子范围前块中,删除直到第一个冒号字符。但还不要打印它,因为我们现在不知道子范围的后块中是否存在错误或警告。所以我们把它放在保留空间里。
当我们到达子范围中的后块时,我们检测到错误或警告行。然后检索保留并立即打印。
结果:
stderr
Thu Jun 25 12:24:52 +0100 2020
3000
20/06/25 12:19:39 ERROR Exception found
20/06/25 12:20:41 WARN Warning as the node is accessed without started
如果您还需要错误/警告消息的行号,请使用以下从上面修改的 sed 命令:
sed -Ee '
/LogType:stderr/,/^\s*$/!d
/Log Contents:/,/^\s*$/!{
s/^[^:]*://;b
}
/\s(ERROR|WARN)\s/!d
p;=;d
' logfile |
sed -Ee '/\s(ERROR|WARN)\s/N;s/\n/ on line #/'
您也可以使用 awk n perl 等其他工具来完成此工作: 注意:首先删除空白行中的尾随空格。
awk '
BEGIN {
RS = "\n\n"
FS = "\nLog Contents:\n"
OFS = "\n"
ORS = OFS
spc = "[[:blank:]]"
str = "(ERROR|WARN)"
pat = spc str spc
}
/^LogType:stderr/ &&
NF == 2 {
p = $1; gsub(/(^|\n)[^:]+:/, "\n", p);sub(/./, "", p)
N = split($2, a, /\n/)
print p
for ( i=1; i<=N; i++ )
if ( a[i] ~ pat )
print a[i]
}
' logfile
perl -F'/^Log\hContents:$/m,$_,2' -00 -ne '
next if ! /\ALogType:stderr$/m;
(my $pre = $F[0])=~ s/.*?://gm;
my $post = join "\n",
grep { /\s(?:ERROR|WARN)/ }
split /\n/, $F[1];
print($pre,$post);
' logfile