SpamProbe 1.4b review

by on

SpamProbe operates on a different basis entirely

License: GPL (GNU General Public License)
File size: 0K
Developer: Brian Burton
0 stars award from

SpamProbe operates on a different basis entirely. Instead of using pattern matching and a set of human generated rules SpamProbe relies on a Bayesian analysis of the frequency of words used in spam and non-spam emails received by an individual person.

The process is completely automatic and tailors itself to the kinds of emails that each person receives.

Here are some key features of "SpamProbe":
Spam detection using Bayesian analysis of terms contained in each email. Words used often in spams but not in good email tend to indicate that a message is spam. Generally over 90% effective at detecting spam once a few hundred spams have been classified.
Automatically learns from incoming mails as they are classified. Incorporates user's feedback to tailor classification to each user's personal tastes.
Works with procmail, maildrop, or a similar tool to produce a complete server or client side spam filtering system.
Written in C++ for good performance. Database access using Peter Graf's PBL ISAM library or Berkeley DB for quick startup and fast term count retrieval.
Recognition and decoding of MIME attachments in quoted-printable and base64 encoding. Automatically skips non-text attachments. MIME decoding enables SpamProbe to make decisions based on words in the emails rather than base64 gobbledigook.
Counts two word phrases as well as single words for higher precision.
Ignores HTML tags in emails for scoring purposes unless the -h command line option is used. Many spams use HTML and few humans do so HTML tends to become a powerful recognizer of spams. However in the author's opinion this also substantially increases the likelihood of false positives if someone does send a non-spam emai containing HTML tags. SpamProbe does pull urls from inside of html tags however since those tend to be spammer specific.
Locks mboxes and databases using fcntl file locking to avoid problems when multiple emails arrive simultaneously.
Scores only the Received, Subject, To, From, and Cc headers. All other headers are ignored to make it hard for spammers to hide non-spammy words in X- headers to fool the filter. The -H command line option can be used to override this.
Supports Content-Length: field in mbox headers. This can be disabled using -Y option to use only From_ to recognize new messages.
Uses MD5 hash of emails to recognize reclassification of an already classified spam to avoid distortion of the word counts if emails are reclassified. This way emails can be kept in an mbox that is repeatedly scanned by spamprobe without counting them more than once.
Provides a date stamp based database cleanup command to remove terms from the database if their counts never rise above a certain threshold value (normally 2).
Provides an edit-term command allowing users to directly modify the counts of individual terms. For example to force a particular term to be considered spammy or good.

What's New in This Release:
This release fixes a pair of bugs related to email messages with no lines in their bodies.
Headers from email messages with no bodies were not being tokenized, and it was possible for a message with invalid MIME headers and no body to lead to a crash.

SpamProbe 1.4b search tags