Learning Ham and Spam with Rspamd

Rspamd does a pretty good job in catching spam. It does an even better job when trained with a bunch of spam and ham messages.

In order to do so you can feed ham and spam to rspamd using the rspamd client rspamd.

Simply point rspamc to the direcotries - most likely within your mail storarge on your mail server - where you have a collection of spam or ham messages.

For learning spam:

rspamc learn_spam /direcotry/to/your/spam/collection

For learning ham:

rspamc learn_ham /direcotry/to/your/ham/collection

In my setup I have only virtual users at /var/mail/vusers/ and the typical Spam direcotry of a user would be /var/mail/vusers/<username>/Maildir/junk/cur/

Learning spam and ham automatically

Rspamd has a feature to learn spam and ham automatically. In order to use it, you need to activate the autolearn section (a sample is provided in /etc/rspamd/statistic.conf). A good place to do that is by creating the file /etc/rspamd/local.d/classifier-bayes.conf.

It could look like this:

autolearn {
  spam_threshold = 6.0;
  junk_threshold = 4.0;
  ham_threshold = -0.5;
  check_balance = true;
  min_balance = 0.9;
  options {
    probability_check {
      spam_min = 0.92;
      ham_max = 0.08;
    }
    logging { enabled = false; }
  }
}

The Spam-, Junk- and Ham-Thresholds define the Scores at which point rspamd processes a message automatically and adds these messages to it’s Bayesian knowledge.

What Spam and Ham is supposed to mean is pretty clear. Junk in this section is any mail, that is above the defined threshold and had it’s subject rewritten or a spam header added and has appeared positively flagged at least twice.

When check_balance is set to true, which is the defaul, rspamd stopps learning either ham or spam when the two are not balanced anymore, i.e. it stops learning spam when there are way more spam messages than ham messages. Or vice versa. With min_balance you can provide the ratio (1/min_balance) where rspamd stops learning. min_balance defaults to 0.9. With this default it stops learning when either spam or ham outgrows the other by 1/0.9=1.11, which means 11%.

In further options you can influence logging of the autolearning, disable the balance guard or change the defaults of the probability check. The Bayesian filter should be trained with messages that are in a more or less greyish zone between ham and spam. By default messages beneath a spam probability of 0,1 (ham_max) and above 0,9 (spam_min) will not be used for training.