问题在于 csv 文件的一列中有一个或多个电子邮件地址。输出需要为每个电子邮件地址占一行。如何在电子邮件地址上执行 for 循环以对每个电子邮件地址重复该行?我假设我想使用正则表达式来查找第二列中的所有电子邮件地址,然后循环该数组,但如何将所有电子邮件放入数组中?
这是简单的 awk 脚本:
BEGIN {
FPAT = "([^,]*)|(\"[^\"]+\")"
OFS=","
}
{
name=substr($1,2,length($1)-2)
email=substr($2,2,length($2)-2)
print name, email
}
输入:
"agrippa","[email protected]"
"elvirka","[email protected]"
"Inofs","[email protected];[email protected]"
"bekbz","[email protected],[email protected]"
"njkzif","[email protected]|[email protected]"
"njycz","[email protected]:[email protected]"
"DanielEdict","[email protected]"
"JosEmbesy","[email protected] , [email protected]"
"Walterdon","[email protected] ; [email protected]"
"Kennethlob","[email protected]"
"Ninosh","[email protected]"
"Patrickbam","[email protected]"
所需的输出:
agrippa,[email protected]
elvirka,[email protected]
Inofs,[email protected]
Inofs,[email protected]
bekbz,[email protected]
bekbz,[email protected]
njkzif,[email protected]
njkzif,[email protected]
njycz,[email protected]
njycz,[email protected]
DanielEdict,[email protected]
JosEmbesy,[email protected]
JosEmbesy,[email protected]
Walterdon,[email protected]
Walterdon,[email protected]
Kennethlob,[email protected]
Ninosh,[email protected]
Patrickbam,[email protected]
更多关于真实数据的信息,它不仅仅是两列。这是真实输入数据的标头:
"Created","first name","last name","address1","address2","city","state","zip","country","phone Office","phone Cell","phone Home","company Name","webSite","email","NoEmail","License Type","Issued Date","License Expires"
输出也不只是两列,电子邮件当前不是输出中的最新内容,但如果需要,它可以是。
关于输入数据的另一件事是,它是一个 CSV 文件,所有数据都带有引号,除非没有数据,否则就没有引号。 FPAT 似乎处理得很好,除了每列周围都有引号,我使用子字符串来去掉引号。
这是真实输入的示例
"9/1/2019","Can","Back","77 High Drive","","Chicago","IL","45099","USA","555-555-8521",,,"company name","http://www.yourcomapny.co.uk/","[email protected],[email protected]","","foobar","9/1/2019","9/1/2020"
答案1
使用 GNU awk for FPAT
(然后,因为我们已经需要 gawk,所以还使用gensub()
和\s
简写[[:space:]]
):
$ cat tst.awk
BEGIN {
FPAT = "([^,]*)|(\"[^\"]+\")"
OFS=","
}
{
name = gensub(/^"|"$/,"","g",$1)
n = split(gensub(/^"|"$/,"","g",$2),emails,/\s*[;,|:]\s*/)
for (i=1; i<=n; i++) {
print name, emails[i]
}
}
$
$ awk -f tst.awk file
agrippa,[email protected]
elvirka,[email protected]
Inofs,[email protected]
Inofs,[email protected]
bekbz,[email protected]
bekbz,[email protected]
njkzif,[email protected]
njkzif,[email protected]
njycz,[email protected]
njycz,[email protected]
DanielEdict,[email protected]
JosEmbesy,[email protected]
JosEmbesy,[email protected]
Walterdon,[email protected]
Walterdon,[email protected]
Kennethlob,[email protected]
Ninosh,[email protected]
Patrickbam,[email protected]
FWIW 我通常使用该*sub(/^"|"$/,"",...)
方法从 CSV 字段中删除可能的前导/训练双引号,因为它比该substr()
方法有一个好处,即在没有双引号的情况下不会破坏字段。
您可能还想添加一些错误检测,以防电子邮件地址损坏或您忘记处理的情况(例如 中的分隔符[;,|:]
):
$ cat tst.awk
BEGIN {
FPAT = "([^,]*)|(\"[^\"]+\")"
OFS=","
}
{
name = gensub(/^"|"$/,"","g",$1)
n = split(gensub(/^"|"$/,"","g",$2),emails,/\s*[;,|:]\s*/)
for (i=1; i<=n; i++) {
email = emails[i]
if ( gsub(/@/,"&",email) != 1 ) {
printf "ERROR: too few or too many email addresses in \"%s\"\n", email | "cat>&2"
exit 1
}
print name, email
}
}
如果你真的想验证电子邮件地址,FWIW 在过去 5 年左右的时间里没有任何问题,我知道我一直在使用这个修改后的正则表达式版本http://www.regular-expressions.info/email.html(我特别使用 [a-zA-Z] 而不是 [:alpha:] 因为我只想接受在我的语言环境中被认为是这样的字母 - 您决定什么对您的应用程序有意义):
(email ~ /^[0-9a-zA-Z._%+-]+@[0-9a-zA-Z.-]+\.[a-zA-Z]{2,}$/)
答案2
不确定我是否理解您对 15+ 和 7 列的括号内评论,但对于给出的示例,请尝试
awk -F, '
{gsub (/[" ]/,_) # remove double quotes and space all over
D1 = $1 # save field 1 and
sub ($1 FS, _) # remove it from line
n = split ($0, T, /[,;:\|]/) # split the residual line into array T
for (i=1; i<=n; i++) print D1, T[i] # print former $1, and each T element
}
' OFS=, file
agrippa,[email protected]
elvirka,[email protected]
Inofs,[email protected]
Inofs,[email protected]
.
.
.
Patrickbam,[email protected]