如何使用 awk 来纠正和统一具有多列和多行的损坏文件?

如何使用 awk 来纠正和统一具有多列和多行的损坏文件?

我有一个 CSV 格式的多行文件,包含 5 列(字段)。我需要统一并更正损坏的第一列,其中有许多我需要统一的不同格式的代码。我的第一列代码的完整最终格式应该是00AB[0-9][0-9][0-9][0-9][0-9]其中 [0-9] 可以是任意数字如00AB21345。前四位数字IE00AB 应该始终保持原样。但之后的 5 位数字(IE[0-9][0-9][0-9][0-9][0-9]) 可以是任何数字,如果有任何 > 5 位数字,则最左边缺失的数字应替换为0。

Example  <111> --> <00AB00111> ; or <1111> --> <00AB01111>. 

举个例子,假设我有一个以下文件:

111     xx  yy  zzz ddd
1111    xx  yy  zzz ddd
11111   xx  yy  zzz ddd
A111    xx  yy  zzz ddd
A1111   xx  yy  zzz ddd
A11111  xx  yy  zzz ddd
AB111   xx  yy  zzz ddd
AB1111  xx  yy  zzz ddd
AB11111 xx  yy  zzz ddd
0A111   xx  yy  zzz ddd
0A1111  xx  yy  zzz ddd
0A11111 xx  yy  zzz ddd
0AB111  xx  yy  zzz ddd
0AB1111 xx  yy  zzz ddd
0AB11111 xx yy  zzz ddd
00A111  xx  yy  zzz ddd
00A1111 xx  yy  zzz ddd
00A11111xx  yy  zzz ddd
00AB111 xx  yy  zzz ddd
00AB1111 xx yy  zzz ddd
0AB11111 xx yy  zzz ddd
00AB12344   xx  yy  zzz ddd
00AB34527   xx  yy  zzz ddd
00AB56278   xx  yy  zzz ddd
00AB98902   xx  yy  zzz ddd

为了涵盖所有可能的情况,我编写了以下很长的 awk 脚本。粗体格式代表可以在我的文件中找到需要更正的潜在场景。
我的请求,有人知道任何 awk 脚本可以用更小的脚本解决这个问题吗?如果是的话,你能详细解释给我学习吗:)

##111 Awk -F',' '{if($0~/[0-9][0-9][0-9]/){print "001AB00"suBstr($1,1,3)","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' SC3.csv > y1.csv

##1111
Awk -F',' '{if($0~/[0-9][0-9][0-9][0-9]/){print "001AB"suBstr($1,1,4)","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y1.csv > y2.csv
##11111
Awk -F',' '{if($0~/[0-9][0-9][0-9][0-9][0-9]/){print "001AB" suBstr($1,1,5)","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y2.csv > y3.csv
##A111
Awk -F',' '{if($0~/[A-Z][0-9][0-9][0-9]/){print "001"suBstr($1,1,1) "B00"suBstr($1,2,4)","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y3.csv > y4.csv
##A1111
Awk -F',' '{if($0~/[A-Z][0-9][0-9][0-9][0-9]/){print "001"suBstr($1,1,1) "B0" suBstr($1,2,5)","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y4.csv > y5.csv
##A11111
Awk -F',' '{if($0~/[A-Z][0-9][0-9][0-9[0-9][0-9]/){print "001"suBstr($1,1,1) "B" suBstr($1,2,6)","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y5.csv > y6.csv
##AB111
Awk -F',' '{if($0~/[A-Z][A-Z][0-9][0-9][0-9]/){print "001"suBstr($1,1,2) "00" suBstr($1,3,5)","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y6.csv > y7.csv
##AB1111
Awk -F',' '{if($0~/[A-Z][A-Z][0-9][0-9][0-9][0-9]/){print "001"suBstr($1,1,2)"0" suBstr($1,3,6)","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y7.csv > y8.csv
##AB11111
Awk -F',' '{if($0~/[A-Z][A-Z][0-9][0-9][0-9][0-9][0-9]/){print "001"suBstr($1,1,7)","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y8.csv > y9.csv
##1A111
Awk -F',' '{if($0~/[0-9][A-Z][0-9][0-9][0-9]/){print "00"suBstr($1,1,2) ",B00" suBstr($1,3,5) ","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y9.csv > y10.csv
##1A1111  
Awk -F',' '{if($0~/[0-9][A-Z][0-9][0-9][0-9][0-9]/){print "00"suBstr($1,1,1) "B0" suBstr($1,3,6) ","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y10.csv > y11.csv
##1A11111
Awk -F',' '{if($0~/[0-9][A-Z][0-9][0-9][0-9][0-9][0-9]/){print "00"suBstr($1,1,2) "B" suBstr($1,3,7) ","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y11.csv > y12.csv
##1AB111
Awk -F',' '{if($0~/[0-9][A-Z][A-Z][0-9][0-9][0-9]/){print "00"suBstr($1,1,1) suBstr($1,1,3)"00" suBstr($1,4,6) ","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y12.csv > y13.csv
##1AB1111
Awk -F',' '{if($0~/[0-9][A-Z][A-Z][0-9][0-9][0-9][0-9]/){print "00" suBstr($1,1,3) "0" suBstr($1,4,7) ","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y13.csv > y14.csv
##1AB11111
Awk -F',' '{if($0~/[0-9][A-Z][A-Z][0-9][0-9][0-9][0-9][0-9]/){print "00" suBstr($1,1,8) ","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y14.csv > y15.csv
##11A111
Awk -F',' '{if($0~/[0-9][0-9][A-Z][0-9][0-9][0-9]/){print "0" suBstr($1,1,3)"B00" suBstr($1,4,6) ","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y15.csv > y16.csv
##11A1111
Awk -F',' '{if($0~/[0-9][0-9][A-Z][0-9][0-9][0-9]/){print "0" suBstr($1,1,3)"B0" suBstr($1,4,7) ","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y16.csv > y17.csv
##11A11111
Awk -F',' '{if($0~/[0-9][0-9][A-Z][0-9] [0-9][0-9][0-9]/){print "0" suBstr($1,1,3)"B" suBstr($1,4,8) ","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y17.csv > y18.csv
##11AB111
Awk -F',' '{if($0~/[0-9][0-9] [A-Z][[A-Z][0-9][0-9][0-9]/){print "0" suBstr($1,1,4)"00" suBstr($1,5,7) ","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y18.csv > y19.csv
##11AB1111
Awk -F',' '{if($0~/[0-9][0-9] [A-Z][[A-Z][0-9][0-9][0-9][0-9]/){print "0" suBstr($1,1,4)"0" suBstr($1,5,8) ","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y19.csv > y20.csv
##1AB11111
Awk -F',' '{if($0~/[0-9][0-9] [A-Z][[A-Z][0-9][0-9][0-9][0-9][0-9]/){print "0" suBstr($1,5,9) ","$2","$3","$4","$5;}else{print $1","$2","$3","$4","$5;}}' y20.csv > y21.csv` 

答案1

或许:

awk 'sub("^0?0?A?B?","",$1) && $1=sprintf("00AB%05d",$1)'

删除00AB字段 1 中的所有前导片段,然后将其转换为00AB后跟用零填充的其余数字,最长长度为 5。

该表达式始终为 true,因此{ print }会触发隐式操作。总是sub正确的,因为正则表达式可以为空:有点偷偷摸摸!即使^0?0?A?B?匹配空字符串,替换也会发生,因为这是成功的匹配。

相关内容