我在尝试创建一个 awk 脚本来检查并可能更正文本文件中的每一行时遇到一些麻烦。
考虑这个例子:
$ cat employee.txt
"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla",
"Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Techno
logy","6000"
"501","Ritu","Accounting","5400"
正如您所看到的,某些线条似乎在错误的点处断开。该模式应如下所示:
$ cat employee.txt
"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla","Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Technology","6000"
"501","Ritu","Accounting","5400"
所以我想知道 awk 中是否有一种方法来确定是否不遵循该模式,例如通过验证每行中的逗号数量,然后将断行退格。
我收到这样的包含数百或数千行的文件,因此修复所有断线的手动工作非常乏味。
我正在创建一个控制文件以使用 SQLLDR 将数据加载到表中,但由于文本文件包含断行而出现错误。所以我的解决方案是通过脚本修复每一行。
有什么想法吗?脚本不必是 Awk 中的。
答案1
$ awk -F, 'FNR == 1 { nf = NF } { while (NF < nf || !/[^,]"$/) { line = $0; getline; $0 = line $0 }; print }' file
"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla","Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Technology","6000"
"501","Ritu","Accounting","5400"
这使用awk
和假设第一行具有正确的字段数并且任何字段都不能包含嵌入的逗号。它进一步假设没有一条线路会拥有太多字段,即一行可能有额外的换行符,但没有行与下一行/上一行连接。
当找到字段数错误的行(或者不以字符结尾的行"
,这意味着最后一个字段被分割)时,当前行将保存在变量中line
,并读取下一行。然后,当前行将更新为line
刚刚读取的行的串联。这将继续(在多个连续分割线的情况下),直到我们最终得到具有正确数量字段的内容。然后打印重建的线。
NF
是一个特殊awk
变量,保存当前记录中的字段数(默认情况下一条记录为一行)。当$0
(当前记录)被分配或读取新记录时,该数字会自动更新。该nf
变量是我们自己的变量,从第一行开始设置为“正确的字段数”。
答案2
您可以简单地通过正则表达式更正文本:
<input.csv perl -pe 's/^(.+)([^"])\n$/\1\2/g'
给你
"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla","Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Technology","6000"
"501","Ritu","Accounting","5400"
答案3
简短的 awk 方法:
awk -F, '{ printf "%s%s", $0, $NF ~ /^$|[^"]$/? "":ORS }' file
$NF ~ /^$|[^"]$/
- 检查最后一个字段$NF
是否为空字符串^$
或不带双引号的单词[^"]$
输出:
"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla","Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Technology","6000"
"501","Ritu","Accounting","5400"
答案4
另一个awk
解决方案:
awk -F, 'NF==4 { print $0 }; NF!=4 { str= $0; getline; print str $0 }' employee.txt
"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla","Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Technology","6000"
"501","Ritu","Accounting","5400"