删除 csv 文件中与所需格式不匹配的行

删除 csv 文件中与所需格式不匹配的行

我有大量自动生成的 CSV 文件,如下所示:

1603145914502,48.12,0.085,s
1603145914815,48.12,0.020,s
1603145914941,48.12,0.019,s
1603145915404,48.12,0.031,s
1603145915612,48.12,0.033,s
1603145915899,48.12,0.019,s

一个整数、两个浮点数和一个字母。

一些文件由于生成时的线程问题而损坏:

1603145914502,48.12,0.085,s
1603145914815,48.12,0.020,s
1603145914941,48.12,0.019,s
1603145915404,48.12,0.031,s
1603145915612,48.12,0.033,s
1603145915899,48.12,0.019,s
1603145914502,48.12,0.085,s915899,48.12,0.019,s
1603145914815,48.12,0.020,s
1603145914941,48.12,0.019,s
1603145915404,48.12,0.031,s
1603145915612,48.12,0.033,s
1603145915899,48.12,0.019,s
1459143
1603145914815,48.12,0.020,s
1603145914941,48.12,0.019,s
1603145915404,48.12,0.031,s

有没有办法找到并删除不符合格式的行?看起来 awk 可以做得很好,但我完全不知道如何使用它:)

如果有办法做到这一点,如果该命令也能得到解释,我将非常感激,以便我可以从中学到一些东西。


编辑:我正在澄清格式:

整型、浮点型、浮点型、字符型

逗号后面永远不会有空格。这些值可以是符合上述格式的任何值。

答案1

其中任何一个都应该是您匹配以下简单/基本格式(例如无符号、无指数)所需的全部INT,FLOAT,FLOAT,CHAR

grep -E '^[0-9]+,([0-9]+\.[0-9]+,){2}[[:alpha:]]$' file

sed -En '/^[0-9]+,([0-9]+\.[0-9]+,){2}[[:alpha:]]$/p' file

awk '/^[0-9]+,([0-9]+\.[0-9]+,){2}[[:alpha:]]$/' file

答案2

有没有办法找到并删除不符合格式的行?

方法有很多,这里是一种:

$ perl -n -i.bak -e 'print if /\d{13},\d\d.\d\d,\d\.\d\d\d,s$/' t.dat

$ diff t.dat.bak t.dat
7d6
< 1603145914502,48.12,0.085,s915899,48.12,0.019,s
13d11
< 1459143

$ cat t.dat
1603145914502,48.12,0.085,s
1603145914815,48.12,0.020,s
1603145914941,48.12,0.019,s
1603145915404,48.12,0.031,s
1603145915612,48.12,0.033,s
1603145915899,48.12,0.019,s
1603145914815,48.12,0.020,s
1603145914941,48.12,0.019,s
1603145915404,48.12,0.031,s
1603145915612,48.12,0.033,s
1603145915899,48.12,0.019,s
1603145914815,48.12,0.020,s
1603145914941,48.12,0.019,s
1603145915404,48.12,0.031,s
$

我倾向于在 awk/sed 之前使用 perl,但使用 awk 也可以以大致相同的方式执行相同的操作


如果该命令也能得到解释,我将非常感激,这样我就可以从中学到一些东西。

解释

  • -n循环文件中的行,但不将它们打印到 STDOUT
  • -i对文件进行就地编辑
  • -i.bak并保留具有指定文件扩展名的备份副本,以防万一我犯了错误!
  • -e 'script'在脚本中运行命令(在 -n 选项之后的每一行输入上)
  • print if ...如果与条件表达式匹配则打印该行
  • / ... /- 用这个正则表达式用于模式匹配
  • ^在一行的开头...
  • \d匹配一个数字
  • {3}恰好匹配先前指定的三个字符
  • ,匹配原义逗号字符
  • \.匹配文字停止字符(否则.是通配符元字符)
  • s匹配文字 s 字符
  • $匹配行尾(即该行上不能再有任何字符。

更灵活的表达式是 `^\d+,\d+.\d+,\d+.\d+,[a-zA-Z]$

  • +至少一个前一个字符
  • [...]该指定集合之一
  • [a-z]a 和 z(含)之间的任何小写 ASCII 字符
  • [[:alpha:]]POSIX 字母集中的任何字符
  • \p{Lowercase_Letter}任何 Unicode 字符小写字母财产

Perl 正则表达式与 awk/grep 中使用的正则表达式略有不同。我认为现代版本的 awk/grep 可以选择使用 perl 风格的正则表达式。请参阅 grep-P选项的手册页

答案3

# expect
#          1         2
# 123456789012345678901234567
# 160314591xxxx,48.12,0.0xx,s

grep -Ex '160314591[0-9]{4},48\.12,0\.0[0-9]{2},s' < file.csv

会进行严格匹配。您可以通过调整正则表达式来或多或少地严格限制您希望它匹配的内容。

相关内容