删除电子邮件 (.eml) 重复项

删除电子邮件 (.eml) 重复项

我有一个文件夹,里面有大约 50,000 封 .eml 格式的电子邮件。有很多重复的邮件,甚至有三封或四封,我估计总共大约有 30,000 封。我尝试使用 Mozilla Thunderbird 附加组件 Remove Duplicate Messages(替代)来删除重复邮件,但它只删除了一小部分(几百封)。然后,我使用了 Windows 桌面应用程序,例如 Wise duplicate finder、duplicate cleaner free、AllDup、Fast Duplicate finder 和 Anti-Twin,逐字节(60% 比较),但这些应用程序都无法找到正确的重复邮件(同样,我只删除了其中的一部分,这次是几千封)。

我附上了两封电子邮件的示例,尽管它们的源代码略有不同(并且文件名也不同),但它们基本上是相同的 - 它们是在同一时间从同一个电子邮件地址发送的,并且文件大小也相同:

第一封电子邮件- 消息-1-34437.eml

Received: from e11mailgw02.isp.com ([212.200.12.195]) by mtain3.isp.com (Sun Java(tm) System Messaging Server 6.3-4.01 (built Aug  3 2007; 32bit)) with ESMTP id <[email protected]> for user@com; Tue, 02 Jun 2009 22:53:58 +0200 (CEST)
Received: from unknown (HELO vps.mafiascene.com) ([69.73.156.173]) by e11mailgw02.isp.com with ESMTP; Tue, 02 Jun 2009 22:53:57 +0200
Received: (qmail 24030 invoked by uid 48); Tue, 02 Jun 2009 16:53:51 -0400
Date: Tue, 02 Jun 2009 16:53:51 -0400
From: "Mafia Scene" <[email protected]>
Subject: Mafia Scene Registration Confirmation
To: <user@com>
X-Priority: 3
X-MSMail-Priority: Normal
Importance: Normal
Message-ID: <[email protected]>
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Au0JAFEuJUpFSZyt/2dsb2JhbACOFhEBsRIRCAMEj2iCMR4IBAwEgSAF
X-IronPort-AV: E=McAfee;i="5300,2777,5634"; a="7766158"
X-MimeOLE: Produced By Microsoft MimeOLE V14.0.8089.726
Old-X-EsetId: 4FAA1F2928B4776950AC1F7F23E634
X-EsetId: 745B6128E6F033696B5D617DE9A773
X-EsetScannerBuild: 6455


Thank you for registering with Mafia Scene!



The details you registered your account with at 4:53pm EDT Tuesday - 2nd June 2009 are as follows:

Username: username 
Password: password

To active your account you MUST visit the following link WITHIN the next 24 HOURS.

http://mafiascene.com/modules.php?name=users&action=activate&id=c284c0e0a7a7aec0772709511b2b8f3e

Regards,

The Mafia Scene Staff


__________ Information from ESET NOD32 Antivirus, version of virus signature database 4124 (20090602) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com





__________ Information from ESET NOD32 Antivirus, version of virus signature database 4801 (20100124) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com

第二封电子邮件- 消息-1-54557.eml

Received: from e11mailgw02.com ([212.200.12.195])
 by mtain3.isp.com
 (Sun Java(tm) System Messaging Server 6.3-4.01 (built Aug  3 2007; 32bit))
 with ESMTP id <[email protected]> for
 user@com; Tue, 02 Jun 2009 22:53:58 +0200 (CEST)
Received: from unknown (HELO vps.mafiascene.com) ([69.73.156.173])
 by e11mailgw02.com with ESMTP; Tue, 02 Jun 2009 22:53:57 +0200
Received: (qmail 24030 invoked by uid 48); Tue, 02 Jun 2009 16:53:51 -0400
Date: Tue, 02 Jun 2009 16:53:51 -0400
From: Mafia Scene <[email protected]>
Subject: Mafia Scene Registration Confirmation
To: user@com
Message-id: <[email protected]>
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result:
 Au0JAFEuJUpFSZyt/2dsb2JhbACOFhEBsRIRCAMEj2iCMR4IBAwEgSAF
X-IronPort-AV: E=McAfee;i="5300,2777,5634"; a="7766158"
X-EsetId: 4FAA1F2928B4776950AC1F7F23E634


Thank you for registering with Mafia Scene!



The details you registered your account with at 4:53pm EDT Tuesday - 2nd June 2009 are as follows:

Username: username
Password: password

To active your account you MUST visit the following link WITHIN the next 24 HOURS.

http://mafiascene.com/modules.php?name=users&action=activate&id=c284c0e0a7a7aec0772709511b2b8f3e

Regards,

The Mafia Scene Staff


__________ Information from ESET NOD32 Antivirus, version of virus signature database 4124 (20090602) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com

有没有什么方法可以检测出这些电子邮件是否重复?

答案1

标题完全不同,内容也不同。这些信息无法通过常见的重复查找解决方案来辨别。

您必须想出一些自己的办法。例如,您可以编写一个脚本来提取与您相关的信息,标记可疑重复项,并应用其他技术来检查是否确实存在重复项。这可能在某种程度上需要手动操作。

更简单的第一步可能是切断标题并运行比较。

相关内容