我保存了大量的数据库条目,其中全是垃圾邮件。我希望能够将每个条目的文本输出导入 spamassassin 或类似工具,以便能够获得垃圾邮件可能性的评分,但不需要从邮箱进行整个机器学习,甚至不需要在邮件服务器上运行。似乎我发现的所有内容都极其偏向于电子邮件,而不仅仅是简单的stdin > process > stdout
类型问题。
如果有用脚本语言编写的,那就好了,但我更喜欢可以与开箱即用的 centos 机器配合使用的东西。任何帮助都值得感激。
答案1
有趣的是,您提到了 spamassassin,因为它有一种模式似乎正是您想要的(/tmp/spammy
在这种情况下包含一个候选电子邮件):
[me@lory tmp]$ spamassassin < /tmp/spammy
Oct 20 11:54:47.097 [19986] warn: netset: cannot include 127.0.0.1/32 as it has already been included
From: "REDACTED" <redacted>
To: REDACTED
Subject: Pharmacy
Date: 20 Oct 2014 02:22:04 +0100
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lory.teaparty.net
X-Spam-Flag: YES
X-Spam-Level: *********
X-Spam-Status: Yes, score=9.2 required=3.9 tests=BAYES_20,MISSING_MID,
NO_RECEIVED,NO_RELAYS,TVD_SPACE_RATIO,URIBL_BLACK,URIBL_DBL_SPAM,
URIBL_JP_SURBL,URIBL_SBL,URIBL_WS_SURBL autolearn=no version=3.3.1
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="----------=_5444E9FB.89EA3D9F"
This is a multi-part message in MIME format.
------------=_5444E9FB.89EA3D9F
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
Spam detection software, running on the system "lory.teaparty.net", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
the administrator of that system for details.
Content preview: Good medicines special http://canadiantabletstore.com/ [...]
Content analysis details: (9.2 points, 3.9 required)
pts rule name description
---- ---------------------- --------------------------------------------------
2.5 URIBL_DBL_SPAM Contains a spam URL listed in the DBL blocklist
[URIs: canadiantabletstore.com]
1.7 URIBL_BLACK Contains an URL listed in the URIBL blacklist
[URIs: canadiantabletstore.com]
1.6 URIBL_WS_SURBL Contains an URL listed in the WS SURBL blocklist
[URIs: canadiantabletstore.com]
1.2 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist
[URIs: canadiantabletstore.com]
-0.0 NO_RELAYS Informational: message was not relayed via SMTP
1.6 URIBL_SBL Contains an URL's NS IP listed in the SBL blocklist
[URIs: canadiantabletstore.com]
-0.0 BAYES_20 BODY: Bayes spam probability is 5 to 20%
[score: 0.1750]
0.5 MISSING_MID Missing Message-Id: header
-0.0 NO_RECEIVED Informational: message has no Received headers
0.0 TVD_SPACE_RATIO TVD_SPACE_RATIO
------------=_5444E9FB.89EA3D9F
Content-Type: message/rfc822; x-spam-type=original
Content-Description: original message before SpamAssassin
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
Date: 20 Oct 2014 02:22:04 +0100
From: "REDACTED" <REDACTED>
To: REDACTED
Subject: Pharmacy
Good medicines special
http://canadiantabletstore.com/
------------=_5444E9FB.89EA3D9F--