简单的命令行纯文本垃圾邮件或正常垃圾邮件分类器

简单的命令行纯文本垃圾邮件或正常垃圾邮件分类器

我保存了大量的数据库条目,其中全是垃圾邮件。我希望能够将每个条目的文本输出导入 spamassassin 或类似工具,以便能够获得垃圾邮件可能性的评分,但不需要从邮箱进行整个机器学习,甚至不需要在邮件服务器上运行。似乎我发现的所有内容都极其偏向于电子邮件,而不仅仅是简单的stdin > process > stdout类型问题。

如果有用脚本语言编写的,那就好了,但我更喜欢可以与开箱即用的 centos 机器配合使用的东西。任何帮助都值得感激。

答案1

有趣的是,您提到了 spamassassin,因为它有一种模式似乎正是您想要的(/tmp/spammy在这种情况下包含一个候选电子邮件):

[me@lory tmp]$ spamassassin < /tmp/spammy 
Oct 20 11:54:47.097 [19986] warn: netset: cannot include 127.0.0.1/32 as it has already been included
From: "REDACTED" <redacted>
To: REDACTED
Subject: Pharmacy
Date: 20 Oct 2014 02:22:04 +0100
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lory.teaparty.net
X-Spam-Flag: YES
X-Spam-Level: *********
X-Spam-Status: Yes, score=9.2 required=3.9 tests=BAYES_20,MISSING_MID,
        NO_RECEIVED,NO_RELAYS,TVD_SPACE_RATIO,URIBL_BLACK,URIBL_DBL_SPAM,
        URIBL_JP_SURBL,URIBL_SBL,URIBL_WS_SURBL autolearn=no version=3.3.1
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="----------=_5444E9FB.89EA3D9F"

This is a multi-part message in MIME format.

------------=_5444E9FB.89EA3D9F
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Spam detection software, running on the system "lory.teaparty.net", has
identified this incoming email as possible spam.  The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email.  If you have any questions, see
the administrator of that system for details.

Content preview:  Good medicines special http://canadiantabletstore.com/ [...]


Content analysis details:   (9.2 points, 3.9 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 2.5 URIBL_DBL_SPAM         Contains a spam URL listed in the DBL blocklist
                            [URIs: canadiantabletstore.com]
 1.7 URIBL_BLACK            Contains an URL listed in the URIBL blacklist
                            [URIs: canadiantabletstore.com]
 1.6 URIBL_WS_SURBL         Contains an URL listed in the WS SURBL blocklist
                            [URIs: canadiantabletstore.com]
 1.2 URIBL_JP_SURBL         Contains an URL listed in the JP SURBL blocklist
                            [URIs: canadiantabletstore.com]
-0.0 NO_RELAYS              Informational: message was not relayed via SMTP
 1.6 URIBL_SBL              Contains an URL's NS IP listed in the SBL blocklist
                            [URIs: canadiantabletstore.com]
-0.0 BAYES_20               BODY: Bayes spam probability is 5 to 20%
                            [score: 0.1750]
 0.5 MISSING_MID            Missing Message-Id: header
-0.0 NO_RECEIVED            Informational: message has no Received headers
 0.0 TVD_SPACE_RATIO        TVD_SPACE_RATIO



------------=_5444E9FB.89EA3D9F
Content-Type: message/rfc822; x-spam-type=original
Content-Description: original message before SpamAssassin
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Date: 20 Oct 2014 02:22:04 +0100
From: "REDACTED" <REDACTED>
To: REDACTED
Subject: Pharmacy

Good medicines special
http://canadiantabletstore.com/


------------=_5444E9FB.89EA3D9F--

相关内容