如何修复这个庞大的电子邮件数据集?

如何修复这个庞大的电子邮件数据集?

我有一个非常大的数据集,应该由电子邮件组成。但是,有大量无效电子邮件需要从文件中完全删除。

这里有些例子:

89 is @msn .com
[email protected]
89%@yahoo.com
89%[email protected]
89&#39:[email protected]
89'[email protected]
89'[email protected]
89&[email protected]
89+475asdjkl:[email protected]
89+475asdjkl;[email protected]
[email protected]

是否有一种简单的方法可以从文件中删除包含无效电子邮件的行?

答案1

编辑:正如所指出的@伊万尼万,我们可以在 grep 中使用这个正则表达式,而不用编写任何脚本:

grep "^[a-z0-9!#\$%&'*+/=?^_\`{|}~-]+(\.[a-z0-9!#$%&'*+/=?^_\`{|}~-]+)*@([a-z0-9]([a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]([a-z0-9-]*[a-z0-9])?\$" my_email_list.txt >> my_valid_emails.txt

一个简单的脚本就可以为您解决这个问题。正如上面评论的@ilkkachu@马克·普洛特尼克,其中一些示例是完全有效的电子邮件地址。

email_validate.sh:

#!/bin/bash

# email regex check
email_valid="^[a-z0-9!#\$%&'*+/=?^_\`{|}~-]+(\.[a-z0-9!#$%&'*+/=?^_\`{|}~-]+)*@([a-z0-9]([a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]([a-z0-9-]*[a-z0-9])?\$"

# set field separator to new lines
IFS=$'\n' 
# for loop checking line against regex above
for line in $(cat my_email_list.txt); do
    if [[ $line =~ $email_valid ]]; then
        echo "$line is valid"
    else
        echo "$line is invalid"
    fi
done

示例输出:

┌─[root@Fedora]─[~]─[03:27 pm]
└─[$]› ./email_validate.sh
89 is @msn .com is invalid
[email protected] is valid
89%@yahoo.com is valid
89%[email protected] is valid
89&#39:[email protected] is invalid
89'[email protected] is invalid
89'[email protected] is invalid
89&[email protected] is valid
89+475asdjkl:[email protected] is invalid
89+475asdjkl;[email protected] is invalid
[email protected] is valid

如果您需要在文件运行时将它们从文件中删除,只需sed '/$line/d'在 if 语句中添加一个即可。尽管我个人建议将有效电子邮件移至新文件,以防您需要参考旧文件

    if [[ $line =~ $email_valid ]]; then
        echo "$line is valid"
        echo "$line" >> my_valid_emails.txt
    else
        echo "$line is invalid - deleting"
    fi

这将返回类似这样的内容:

┌─[root@Fedora]─[~]─[03:34 pm]
└─[$]› cat my_valid_emails.txt
[email protected]
89%@yahoo.com
89%[email protected]
89&[email protected]
[email protected]

相关内容