我有几个 CSV 日志文件,想要用描述替换这些日志文件第 8 列中存储的状态代码
日志文件如下所示:
ip,date,time,zone,cik,accession,extention,code,size,idx,norefer,noagent,find,crawler,browser
101.xx.xxx.xx,2017-06-23,00:00:00,0.0,1238039.0,0001179110-17-009492,calc.xml,301.0,654.0,0.0,0.0,0.0,10.0,0.0,
101.xx.xxx.xx,2017-06-25,00:00:00,0.0,793347.0,0000798086-17-000026,index.htm,200.0,31791.0,1.0,0.0,0.0,9.0,0.0,
101.xx.xxx.xx,2017-06-28,00:00:00,0.0,918537.0,0001209191-17-041401,index.htm,200.0,9936.0,1.0,0.0,0.0,9.0,0.0,
我想要实现的结果如下所示:
101.xx.xxx.xx,2017-06-23,00:00:00,0.0,1238039.0,0001179110-17-009492,calc.xml,MOVED PERMANENTLY,654.0,0.0,0.0,0.0,10.0,0.0,
101.xx.xxx.xx,2017-06-25,00:00:00,0.0,793347.0,0000798086-17-000026,index.htm,OK,31791.0,1.0,0.0,0.0,9.0,0.0,
101.xx.xxx.xx,2017-06-28,00:00:00,0.0,918537.0,0001209191-17-041401,index.htm,OK,9936.0,1.0,0.0,0.0,9.0,0.0,
我的代码目前看起来像这样,但它没有像我想要的那样访问第 8 列
sed -r 's/^(([^,]*,){7})/200.0/OK/;s/206.0/PARTIAL CONTENT/;s/301.0/MOVED PERMANENTLY/;s/304.0/NOT MODIFIED/;s/400.0/BAD REQUEST/;s/403.0/FORBIDDEN/;s/404.0/NOT FOUND/;s/429.0/TOO MANY REQUESTS/;s/500.0/INTERNAL SERVER ERROR/;s/502.0/BAD GATEWAY/;s/503.0/SERVICE UNAVAILABLE/;s/504.0/GATEWAY TIMEOUT/'
如何修改我的代码以替换第 8 列中的代码?
编辑:非常笨重的解决方案,但它有效:
sed -r 'h; s/^(([^,]*,){7}).*/\1/; x; s/^(([^,]*,){7})//; s/200.0/OK/;s/206.0/PARTIAL CONTENT/;s/301.0/MOVED PERMANENTLY/;s/304.0/NOT MODIFIED/;s/400.0/BAD REQUEST/;s/403.0/FORBIDDEN/;s/404.0/NOT FOUND/;s/429.0/TOO MANY REQUESTS/;s/500.0/INTERNAL SERVER ERROR/;s/502.0/BAD GATEWAY/;s/503.0/SERVICE UNAVAILABLE/;s/504.0/GATEWAY TIMEOUT/; H; x; s/\n//'
答案1
做到这一点的最好方法是使用查找表,以及类似awk
orperl
而不是 的东西sed
。例如:
awk '
BEGIN {
FS=OFS=",";
codes[200] = "OK";
codes[206] = "PARTIAL CONTENT";
codes[301] = "MOVED PERMANENTLY";
codes[304] = "NOT MODIFIED";
codes[400] = "BAD REQUEST";
codes[403] = "FORBIDDEN";
codes[404] = "NOT FOUND";
codes[429] = "TOO MANY REQUESTS";
codes[500] = "INTERNAL SERVER ERROR";
codes[502] = "BAD GATEWAY";
codes[503] = "SERVICE UNAVAILABLE";
codes[504] = "GATEWAY TIMEOUT";
};
FNR == 1 { next }; # skip header line
{ c = $8+0; if (c in codes) { $8 = codes[c] } };
1
' log.csv
101.xx.xxx.xx,2017-06-23,00:00:00,0.0,1238039.0,0001179110-17-009492,calc.xml,MOVED PERMANENTLY,654.0,0.0,0.0,0.0,10.0,0.0,
101.xx.xxx.xx,2017-06-25,00:00:00,0.0,793347.0,0000798086-17-000026,index.htm,OK,31791.0,1.0,0.0,0.0,9.0,0.0,
101.xx.xxx.xx,2017-06-28,00:00:00,0.0,918537.0,0001209191-17-041401,index.htm,OK,9936.0,1.0,0.0,0.0,9.0,0.0,
使$8+0
awk 将第 8 个字段评估为数字,这会导致它删除不必要的.0
.我不确定为什么你的日志文件有一个 HTTP 结果代码的浮点数,但如果文件中有这样的内容,则需要对其进行处理。您可以通过使索引包含 来做到这一点.0
,我更喜欢使用整数值。
如果字段 8 中的代码编号未知,则保持不变。否则,它将被数组中相应的值替换codes
。
顺便说一句,每行末尾的分号在 awk 中是可选的,只有当一行上有多个语句时才需要它们。我把它们留在那里,这样如果你愿意的话,你可以将整个脚本压缩成一行难以阅读的长行。有些人喜欢这样做。我不知道为什么。也许是受虐狂。我认为最好将实际脚本保存到文件中并使用awk -f
或使用一行运行它#!/usr/bin/awk -f
。
另外顺便说一句,找到包含完整 HTTP 响应代码表的文件或网页并不困难。将其保存到文本文件中,将其编辑为合适的格式(numeric-code<tab>description
例如)并让 awk 脚本在任何输入文件之前读取该文件并将其存储在数组中也不困难,而不是硬编码在 BEGIN 块中对数组进行编码。该表不会经常更改,因此可能不值得打扰......但对于需要简单查找表的其他作业来说,请记住这一点。
最后,这是一个 perl 版本。这个使用的是HTTP::状态库模块来自HTTP::消息库集合,其中已包含所有 HTTP 状态代码。
$ perl -MHTTP::Status -F, -lane '
next if $. == 1; # skip header line
$msg = status_message($F[7]+0); # perl arrays start from 0, not 1
$F[7] = uc($msg) if $msg; # uc() to all-caps the msg
print join(",",@F);
close(ARGV) if eof' log.csv
101.xx.xxx.xx,2017-06-23,00:00:00,0.0,1238039.0,0001179110-17-009492,calc.xml,MOVED PERMANENTLY,654.0,0.0,0.0,0.0,10.0,0.0
101.xx.xxx.xx,2017-06-25,00:00:00,0.0,793347.0,0000798086-17-000026,index.htm,OK,31791.0,1.0,0.0,0.0,9.0,0.0
101.xx.xxx.xx,2017-06-28,00:00:00,0.0,918537.0,0001209191-17-041401,index.htm,OK,9936.0,1.0,0.0,0.0,9.0,0.0
除了不必“重新发明轮子”之外,使用 perl 而不是 awk 的其他好处之一是您可以使用文本::CSV模块用于正确的 CSV 解析器(即可以处理引号和嵌入带引号的字段中的逗号),并确保您的输出格式正确的 CSV(再次,根据需要使用引号)。
答案2
尝试这个:
sed -r 's/^(([^,]*,){7})200\.0/\1OK/;s/206.0/PARTIAL CONTENT/;s/301.0/MOVED PERMANENTLY/;s/304.0/NOT MODIFIED/;s/400.0/BAD REQUEST/;s/403.0/FORBIDDEN/;s/404.0/NOT FOUND/;s/429.0/TOO MANY REQUESTS/;s/500.0/INTERNAL SERVER ERROR/;s/502.0/BAD GATEWAY/;s/503.0/SERVICE UNAVAILABLE/;s/504.0/GATEWAY TIMEOUT/'
唯一的变化是从/200.0/
到200\.0/\1
,测试句点\.
而不是任何字符.
并使用反向引用 '\1'。想必您确信其他响应代码(206.0 以上)不会作为该行前面的文本出现(206.0 可能是 IP 地址的一部分...)?
它可以使用其他工具来完成,例如awk
更优雅的工具,但我认为这可以满足您的需求并建立在您已有的基础上。