检查文件中每一行的 awk 脚本

检查文件中每一行的 awk 脚本

我在尝试创建一个 awk 脚本来检查并可能更正文本文件中的每一行时遇到一些麻烦。

考虑这个例子:

$ cat employee.txt
"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla",
"Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Techno
logy","6000"
"501","Ritu","Accounting","5400"

正如您所看到的,某些线条似乎在错误的点处断开。该模式应如下所示:

$ cat employee.txt
"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla","Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Technology","6000"
"501","Ritu","Accounting","5400"

所以我想知道 awk 中是否有一种方法来确定是否不遵循该模式,例如通过验证每行中的逗号数量,然后将断行退格。

我收到这样的包含数百或数千行的文件,因此修复所有断线的手动工作非常乏味。

我正在创建一个控制文件以使用 SQLLDR 将数据加载到表中,但由于文本文件包含断行而出现错误。所以我的解决方案是通过脚本修复每一行。

有什么想法吗?脚本不必是 Awk 中的。

答案1

$ awk -F, 'FNR == 1 { nf = NF } { while (NF < nf || !/[^,]"$/) { line = $0; getline; $0 = line $0 }; print }' file
"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla","Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Technology","6000"
"501","Ritu","Accounting","5400"

这使用awk假设第一行具有正确的字段数并且任何字段都不能包含嵌入的逗号。它进一步假设没有一条线路会拥有太多字段,即一行可能有额外的换行符,但没有行与下一行/上一行连接。

当找到字段数错误的行(或者不以字符结尾的行",这意味着最后一个字段被分割)时,当前行将保存在变量中line,并读取下一行。然后,当前行将更新为line刚刚读取的行的串联。这将继续(在多个连续分割线的情况下),直到我们最终得到具有正确数量字段的内容。然后打印重建的线。

NF是一个特殊awk变量,保存当前记录中的字段数(默认情况下一条记录为一行)。当$0(当前记录)被分配或读取新记录时,该数字会自动更新。该nf变量是我们自己的变量,从第一行开始设置为“正确的字段数”。

答案2

您可以简单地通过正则表达式更正文本:

<input.csv perl -pe 's/^(.+)([^"])\n$/\1\2/g'

给你

"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla","Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Technology","6000"
"501","Ritu","Accounting","5400"

答案3

简短的 awk 方法:

awk -F, '{ printf "%s%s", $0, $NF ~ /^$|[^"]$/? "":ORS }' file
  • $NF ~ /^$|[^"]$/- 检查最后一个字段$NF是否为空字符串^$或不带双引号的单词[^"]$

输出:

"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla","Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Technology","6000"
"501","Ritu","Accounting","5400"

答案4

另一个awk解决方案:

awk -F, 'NF==4 { print $0 }; NF!=4 { str= $0; getline; print str $0 }' employee.txt

"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla","Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Technology","6000"
"501","Ritu","Accounting","5400"

相关内容