使用 sed 替换 CSV 中特定列的内容

使用 sed 替换 CSV 中特定列的内容

我有几个 CSV 日志文件,想要用描述替换这些日志文件第 8 列中存储的状态代码

日志文件如下所示:

ip,date,time,zone,cik,accession,extention,code,size,idx,norefer,noagent,find,crawler,browser
101.xx.xxx.xx,2017-06-23,00:00:00,0.0,1238039.0,0001179110-17-009492,calc.xml,301.0,654.0,0.0,0.0,0.0,10.0,0.0,
101.xx.xxx.xx,2017-06-25,00:00:00,0.0,793347.0,0000798086-17-000026,index.htm,200.0,31791.0,1.0,0.0,0.0,9.0,0.0,
101.xx.xxx.xx,2017-06-28,00:00:00,0.0,918537.0,0001209191-17-041401,index.htm,200.0,9936.0,1.0,0.0,0.0,9.0,0.0,

我想要实现的结果如下所示:

101.xx.xxx.xx,2017-06-23,00:00:00,0.0,1238039.0,0001179110-17-009492,calc.xml,MOVED PERMANENTLY,654.0,0.0,0.0,0.0,10.0,0.0,
101.xx.xxx.xx,2017-06-25,00:00:00,0.0,793347.0,0000798086-17-000026,index.htm,OK,31791.0,1.0,0.0,0.0,9.0,0.0,
101.xx.xxx.xx,2017-06-28,00:00:00,0.0,918537.0,0001209191-17-041401,index.htm,OK,9936.0,1.0,0.0,0.0,9.0,0.0,

我的代码目前看起来像这样,但它没有像我想要的那样访问第 8 列

 sed -r 's/^(([^,]*,){7})/200.0/OK/;s/206.0/PARTIAL CONTENT/;s/301.0/MOVED PERMANENTLY/;s/304.0/NOT MODIFIED/;s/400.0/BAD REQUEST/;s/403.0/FORBIDDEN/;s/404.0/NOT FOUND/;s/429.0/TOO MANY REQUESTS/;s/500.0/INTERNAL SERVER ERROR/;s/502.0/BAD GATEWAY/;s/503.0/SERVICE UNAVAILABLE/;s/504.0/GATEWAY TIMEOUT/'

如何修改我的代码以替换第 8 列中的代码?

编辑:非常笨重的解决方案,但它有效:

 sed -r 'h; s/^(([^,]*,){7}).*/\1/; x; s/^(([^,]*,){7})//; s/200.0/OK/;s/206.0/PARTIAL CONTENT/;s/301.0/MOVED PERMANENTLY/;s/304.0/NOT MODIFIED/;s/400.0/BAD REQUEST/;s/403.0/FORBIDDEN/;s/404.0/NOT FOUND/;s/429.0/TOO MANY REQUESTS/;s/500.0/INTERNAL SERVER ERROR/;s/502.0/BAD GATEWAY/;s/503.0/SERVICE UNAVAILABLE/;s/504.0/GATEWAY TIMEOUT/; H; x; s/\n//'

答案1

做到这一点的最好方法是使用查找表,以及类似awkorperl而不是 的东西sed。例如:

awk '
  BEGIN {
    FS=OFS=",";
    codes[200] = "OK";
    codes[206] = "PARTIAL CONTENT";
    codes[301] = "MOVED PERMANENTLY";
    codes[304] = "NOT MODIFIED";
    codes[400] = "BAD REQUEST";
    codes[403] = "FORBIDDEN";
    codes[404] = "NOT FOUND";
    codes[429] = "TOO MANY REQUESTS";
    codes[500] = "INTERNAL SERVER ERROR";
    codes[502] = "BAD GATEWAY";
    codes[503] = "SERVICE UNAVAILABLE";
    codes[504] = "GATEWAY TIMEOUT";
  };
  FNR == 1 { next };   # skip header line
  { c = $8+0; if (c in codes) { $8 = codes[c] } };
  1
' log.csv
101.xx.xxx.xx,2017-06-23,00:00:00,0.0,1238039.0,0001179110-17-009492,calc.xml,MOVED PERMANENTLY,654.0,0.0,0.0,0.0,10.0,0.0,
101.xx.xxx.xx,2017-06-25,00:00:00,0.0,793347.0,0000798086-17-000026,index.htm,OK,31791.0,1.0,0.0,0.0,9.0,0.0,
101.xx.xxx.xx,2017-06-28,00:00:00,0.0,918537.0,0001209191-17-041401,index.htm,OK,9936.0,1.0,0.0,0.0,9.0,0.0,

使$8+0awk 将第 8 个字段评估为数字,这会导致它删除不必要的.0.我不确定为什么你的日志文件有一个 HTTP 结果代码的浮点数,但如果文件中有这样的内容,则需要对其进行处理。您可以通过使索引包含 来做到这一点.0,我更喜欢使用整数值。

如果字段 8 中的代码编号未知,则保持不变。否则,它将被数组中相应的值替换codes


顺便说一句,每行末尾的分号在 awk 中是可选的,只有当一行上有多个语句时才需要它们。我把它们留在那里,这样如果你愿意的话,你可以将整个脚本压缩成一行难以阅读的长行。有些人喜欢这样做。我不知道为什么。也许是受虐狂。我认为最好将实际脚本保存到文件中并使用awk -f或使用一行运行它#!/usr/bin/awk -f


另外顺便说一句,找到包含完整 HTTP 响应代码表的文件或网页并不困难。将其保存到文本文件中,将其编辑为合适的格式(numeric-code<tab>description例如)并让 awk 脚本在任何输入文件之前读取该文件并将其存储在数组中也不困难,而不是硬编码在 BEGIN 块中对数组进行编码。该表不会经常更改,因此可能不值得打扰......但对于需要简单查找表的其他作业来说,请记住这一点。


最后,这是一个 perl 版本。这个使用的是HTTP::状态库模块来自HTTP::消息库集合,其中已包含所有 HTTP 状态代码。

$ perl -MHTTP::Status -F, -lane '
  next if $. == 1;                # skip header line
  $msg = status_message($F[7]+0); # perl arrays start from 0, not 1
  $F[7] = uc($msg) if $msg;       # uc() to all-caps the msg
  print join(",",@F);
  close(ARGV) if eof' log.csv 
101.xx.xxx.xx,2017-06-23,00:00:00,0.0,1238039.0,0001179110-17-009492,calc.xml,MOVED PERMANENTLY,654.0,0.0,0.0,0.0,10.0,0.0
101.xx.xxx.xx,2017-06-25,00:00:00,0.0,793347.0,0000798086-17-000026,index.htm,OK,31791.0,1.0,0.0,0.0,9.0,0.0
101.xx.xxx.xx,2017-06-28,00:00:00,0.0,918537.0,0001209191-17-041401,index.htm,OK,9936.0,1.0,0.0,0.0,9.0,0.0

除了不必“重新发明轮子”之外,使用 perl 而不是 awk 的其他好处之一是您可以使用文本::CSV模块用于正确的 CSV 解析器(即可以处理引号和嵌入带引号的字段中的逗号),并确保您的输出格式正确的 CSV(再次,根据需要使用引号)。

答案2

尝试这个:

sed -r 's/^(([^,]*,){7})200\.0/\1OK/;s/206.0/PARTIAL CONTENT/;s/301.0/MOVED PERMANENTLY/;s/304.0/NOT MODIFIED/;s/400.0/BAD REQUEST/;s/403.0/FORBIDDEN/;s/404.0/NOT FOUND/;s/429.0/TOO MANY REQUESTS/;s/500.0/INTERNAL SERVER ERROR/;s/502.0/BAD GATEWAY/;s/503.0/SERVICE UNAVAILABLE/;s/504.0/GATEWAY TIMEOUT/'

唯一的变化是从/200.0/200\.0/\1,测试句点\.而不是任何字符.并使用反向引用 '\1'。想必您确信其他响应代码(206.0 以上)不会作为该行前面的文本出现(206.0 可能是 IP 地址的一部分...)?

它可以使用其他工具来完成,例如awk更优雅的工具,但我认为这可以满足您的需求并建立在您已有的基础上。

相关内容