分离脚本中的 runon 文本

分离脚本中的 runon 文本

我有以下 csv 输入:

XiaoLi,6705462234,[email protected],NC764
NatkinPook,8044344528,[email protected],VA22345
EliziMoe,5208534566,[email protected],AZ85282
MaTa,4345667345,[email protected],TX91030
DianaCheng,5203456789,[email protected],WY4587
JacksonFive,5206564573,[email protected],AZ85483
AdiSrikanthReddy,6578904566,[email protected],WS67854

我希望它输出以下内容:

Xiao Li 6705462234 [email protected] NC 764
Natkin Pook 8044344528 [email protected] VA 22345
Elizi Moe 5208534566 [email protected] AZ 85282
Ma Ta 4345667345 [email protected] TX 91030
Diana Cheng 5203456789 [email protected] WY 4587
Jackson Five 5206564573 [email protected] AZ 85483
Adi SrikanthReddy 6578904566 [email protected] WS 67854

( FirstName LastName PhoneNumber UserID@Email State Zip)

这就是我到目前为止所拥有的

 awk -F "," ' {print $1, $4, $3, $6}' data3

我无法将名字和姓氏彼此分开,并且州和邮政编码也一起运行。我怎样才能区分这两种情况?

我想使用 awk,有没有办法可以使用 [AZ] 之类的东西来分隔它们的大写字母?

答案1

我看到用户 Steeldriver 的答案已被接受,但我想提供一个我认为更短、更简单且更易于阅读的选项。至少,它展示了 awk 的一些其他功能(OP 总是可以改变他/她的想法):

awk '
  { gsub(","," ")
    $0=gensub("([[:upper:]])([[:digit:]])","\\1 \\2","g")
    $0=gensub("([[:lower:]])([[:upper:]])","\\1 \\2","g")
    print
  }' file.csv

答案2

至少对于gawk(GNU awk) 和mawk,您可以使用该match函数来查找小写-大写或大写-数字转换的索引,然后用于substr剪切和关闭字符串:

awk -F, '
  {c = match($1,/[a-z][A-Z]/)} 
  c>0 {$1 = sprintf("%s %s", substr($1,1,c), substr($1,c+1))}
  {c = match($4,/[A-Z][0-9]/)} 
  c>0 {$4 = sprintf("%s %s", substr($4,1,c), substr($4,c+1))}
  1' file.csv
Xiao Li 6705462234 [email protected] NC 764
Natkin Pook 8044344528 [email protected] VA 22345
Elizi Moe 5208534566 [email protected] AZ 85282
Ma Ta 4345667345 [email protected] TX 91030
Diana Cheng 5203456789 [email protected] WY 4587
Jackson Five 5206564573 [email protected] AZ 85483
Adi SrikanthReddy 6578904566 [email protected] WS 67854

如果您$4确实是美国邮政编码,那么据我所知,格式是固定的,您可以跳过第二个match,然后执行

awk -F, '                                                                                          
  {c = match($1,/[a-z][A-Z]/)} 
  c>0 {$1 = sprintf("%s %s", substr($1,1,c), substr($1,c+1))} 
  {$4 = sprintf("%s %s", substr($4,1,2), substr($4,3))}
  1' file.csv

如果您有一个允许零长度断言的正则表达式引擎,那么它会更整洁一些 - 例如 Perl:

perl -F, -ne '
  print join " ", map { s/(?<=[[:lower:]])(?=[[:upper:]])|(?<=[[:upper:]])(?=[[:digit:]])/ /; $_ } @F
' file.csv
Xiao Li 6705462234 [email protected] NC 764
Natkin Pook 8044344528 [email protected] VA 22345
Elizi Moe 5208534566 [email protected] AZ 85282
Ma Ta 4345667345 [email protected] TX 91030
Diana Cheng 5203456789 [email protected] WY 4587
Jackson Five 5206564573 [email protected] AZ 85483
Adi SrikanthReddy 6578904566 [email protected] WS 67854

相关内容