Slaying Spams with both Bogofilter and SpamAssassin embedded in exim

Ads are spam. Good thing with the internet’s ads is that you can set up countermeasures.

(Disclaimer: yes, there is nothing new here, just an example of setup)

I have plenty of email addresses from different providers, some are definitely history. I could go through the websites of all of these and set up forwarding for the one I no longer use but still want to be able to get mail from, just in case. Well, I would do that if I was using my mail client to fetch mails – because otherwise fetching mails would actually take ages.

But, as I have a local home underclocked 🙂 server, I find way easier and potent to, instead, use ESR’s fetchmail to download them all to a single account that is accessed by my mail client through IMAPS. I have a /etc/fetchmailrc like:

poll with proto POP3
user 'XXX' there with password 'XXX' is 'localuser' here
poll with proto IMAP
user '' there with password 'XXX' is 'localuser' here with ssl
user '' there with password 'XXZ' is 'localuser' here with ssl

Fetchmail download mails than then relies on the installed SMTP, which is Exim, to deliver it to end user account mailbox accessible through IMAPS.

What’s so nifty nifty about? Well, mails will also be filtered for spam. As it happens on the local home server, it will be unnoticeable for the end user that is me. We’ll use several anti-spam tools, not caring about redundancy and time-consumption: DNSBLs, Bogofilter, SpamAssassin, razor2.

So, here we go. Note that Exim (exim4) in Debian use the user Debian-exim. localuser is the recipient end-user, it belongs to the group localuser name after himself.
We will add Debian-user to the group localuser and create a system group dedicated to spamchecking to easily share bayesian databases.:

# addgroup --system spamslayer
# adduser Debian-exim spamslayer
# adduser Debian-exim localuser
# adduser localuser spamslayer

* Bogofilter is a bayesian spam filter . It is said to be faster and lesser time consuming than the SpamAssassin’s own bayesian filter so will run mails through it first. It is installed with the debian package.

Edit /etc/ as follows:


The bayes directory must be created by hand:

# mkdir /var/lib/bogofilter
# chgrp spamslayer /var/lib/bogofilter
# chmod 2777 /var/lib/bogofilter

* SpamAssassin is a powerful, at the cost of time-consumption, spam-killer. It is installed with the debian package.

In the following site-wide config /etc/spamassassin/, I use bayesian filters, razor2, several DNSBLs and I adjust some tests according to my needs:

# Save spam messages as a message/rfc822 MIME attachment instead of
# modifying the original message (0: off, 2: use text/plain instead)
# Keep as it is because bogofilter would not learn properly otherwise,
# as it cannot distinguish report from the spam.
report_safe 0
# Set which networks or hosts are considered 'trusted' by your mail
# server (i.e. not spammers)
trusted_networks 192.168.1.
# Locales
# (I only receive mails in English or French)
ok_locales en fr
# Set the threshold at which a message is considered spam (default: 5.0)
required_score 3.3
# Use Bayesian classifier (default: 1)
# (I created the relevant directory)
use_bayes 1
bayes_file_mode 0777
bayes_path /var/lib/spamassassin-bayes/bayes
score BAYES_20 0.3
score BAYES_40 0.5
score BAYES_50 0.8
score BAYES_60 1
score BAYES_80 2
score BAYES_95 2.5
score BAYES_99 6
# Bayesian classifier auto-learning (default: 1)
# (I may change that, not sure about it)
bayes_auto_learn 1
# Set headers which may provide inappropriate cues to the Bayesian
# classifier
bayes_ignore_header X-Bogosity
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Status
# use razor
# (/etc/razor is the standard debian path)
use_razor2 1
razor_config /etc/razor/razor-agent.conf
score RAZOR2_CF_RANGE_51_100 3.2
# some rbl checks are already made by exim, at RCPT time, not all.
skip_rbl_checks 0
rbl_timeout 30
score RCVD_IN_SBL 15
score RCVD_IN_XBL 15
# adjust some tests scores: lower DUL test
# lower stupid test
# adjust some tests scores
score HTML_FONT_BIG 2.4
score NO_REAL_NAME 2
score SUBJ_ALL_CAPS 2.6
# increase all scores related to drugs: what do I care, duh
score DRUGS_DIET 5
score DRUGS_PAIN 5
score DRUGS_SMEAR1 5
# same goes for porn
score BEST_PORN 5
score FREE_PORN 5
score LIVE_PORN 5
score PORN_15 5
score PORN_16 5
score PORN_URL_SEX 5

The bayes directory must be created:

# mkdir /var/lib/spamassassin-bayes
# chown Debian-exim /var/lib/spamassassin-bayes
# chmod 0777 /var/lib/spamassassin-bayes

Obviously, it implies that razor2 must be properly installed. We install the debian package then set it up. Remember it must run with user Debian-exim, so we do:

# chown -R Debian-exim:spamslayer /etc/razor
# su Debian-exim
$ razor-admin -home=/etc/razor -register
$ razor-admin -home=/etc/razor -create
$ razor-admin -home=/etc/razor -discover

To save ressources, we start SpamAssassin as a daemon (spamd), that will be called using its specific client (spamc). Before using the initd script, edit as follows /etc/defaut/spamassassin:

# Change to one to enable spamd
# SpamAssassin uses a preforking model, so be careful! You need to
# make sure --max-children is not set to anything higher than 5,
# unless you know what you're doing.
OPTIONS="--create-prefs --max-children 5 --helper-home-dir -u Debian-exim -g spamslayer"
# Cronjob
# Set to anything but 0 to enable the cron job to automatically update
# spamassassin's rules on a nightly basis

All that being do, you’ll want to (re)start the daemon with the relevant initd script (/etc/init.d/spamassassin restart here).

* Now we’ll tune Exim to call all by himself first Bogofilter and then SpamAssassin, if necessary only. We use splitted configuration in /etc/exim4/conf.d/. That is debian-specific I think but it does make any difference anyway.

First we define useful transports in /etc/exim4/conf.d/transport/35_spamblock (the name 35_spamblock is arbitrary and the number does not matter here):

driver = pipe
command = /usr/sbin/exim4 -oMr spamslayed-bogofilter -bS
use_bsmtp = true
transport_filter = /usr/bin/bogofilter -l -p -e
home_directory = "/tmp"
current_directory = "/tmp"
# must use a privileged user to set $received_protocol
# on the way back in!
user = Debian-exim
group = spamslayer
log_output = true
return_fail_output = true
return_path_add = false
message_prefix =
message_suffix =
driver = pipe
command = /usr/sbin/exim4 -oMr spamslayed-spamd -bS
use_bsmtp = true
transport_filter = /usr/bin/spamc
home_directory = "/tmp"
current_directory = "/tmp"
# must use a privileged user to set $received_protocol
# on the way back in!
user = Debian-exim
group = spamslayer
log_output = true
return_fail_output = true
return_path_add = false
message_prefix =
message_suffix =

Second we define routers, here in /etc/exim4/conf.d/router/450_spamblock – the order matters, here it is just after 400_exim4-config_system_aliases and before 500_exim4-config_hubuser:

# spam checking
# first bogofilter
debug_print = "R: bogofilter for $local_part@$domain received with protocol $received_protocol with X-Spam-Flag=$h_X-Spam-Flag and X-Bogosity=$h_X-Bogosity"
# When to scan a message :
# - it isn't already flagged as spam
# - it has not yet been spamslayed at all
# - it isn't local ($received_protocol eq "" or local)
condition = "${if and{ {!eqi{$h_X-Spam-Flag:}{yes}} {!eq{$received_protocol}{spamslayed-bogofilter}} {!eq{$received_protocol}{spamslayed-spamd}} {!eq{$received_protocol}{local}} {!eq{$received_protocol}{}} }}"
driver = accept
transport = spamslay_bogofilter
# second spamd
debug_print = "R: spamd for $local_part@$domain received with protocol $received_protocol with X-Spam-Flag=$h_X-Spam-Flag and X-Bogosity=$h_X-Bogosity"
# When to scan a message :
# - it isn't already flagged as spam
# - it has not yet been spamslayed with SA
# - it isn't local ($received_protocol eq "" or local)
condition = "${if and { {!eqi{$h_X-Spam-Flag:}{yes}} {!match{$h_X-Bogosity:}{^Spam}} {!eq {$received_protocol}{spamslayed-spamd}} {!eq{$received_protocol}{local}} {!eq{$received_protocol}{}} }}"
driver = accept
transport = spamslay_spamd
# This route will send any mail that got here to the devnull alias, that
# should be configured in /etc/aliases to be a real link to /dev/null.
# This route should get only mails that have spam score higher than 14.
# This will affect users mails!
condition = "${if ge{$h_X-Spam-Level:}{\*\*\*\*\*\*\*\*\*\*\*\*\*\*} {1}{0} }"
driver = redirect
data = spam
file_transport = address_file
pipe_transport = address_pipe

* Next step, now that spams are flagged, it makes sense to put them apart in the Maildir that will be accessed through IMAPS. I do this with procmail. We set umask for procmail (the IMAP server is configured as such too) to make sure Debian-exim can access stored mails (we want mode 0640, group read access, so the umask is 666-640=026). Here’s the relevant bit of /home/localuser/.procmailrc:

* ^X-Spam-Status: Yes
* ^X-Spam-Flag: YES
* ^X-Bogosity: Spam

At the same time, we make sure Debian-exim can access mails already there (so not affected by umask):

# cd /home/localuser
# chmod 750 -Rv .Maildir
# chmod 0640 -v `find .Maildir -type f`

(PS: you may want to enforce a more restrictive policy, depending on how your server is accessed – but, anyway, Debian-exim is by essence able to tamper with mails you receive, so it won’t make a big difference)

* Training bayesian filters.

Now that spam ended up in a specific Maildir, both SpamAssassin and Bogofilter bayesians filters must be trained to be effective.

We add the following in /etc/cron.d/bayes:

# trains bayesian filters
SPAMDIR="/home/localuser/.Maildir/.Poubelle.Spam/cur/ /home/localuser/.Maildir/.Poubelle.Spam/new/"
# spamd: can handle by itself bogofiltered headers
25 * * * * Debian-exim /usr/bin/sa-learn --spam $SPAMDIR
# bogofilter: not able to clean inappropriate cues from spamd, will do it
# by removing:
# - informational SpamAssassin headers
# - SpamAssassin score and decision (irrelevant)
# (-u was not set as it is discouraged perf-wise in bogofilter's manual)
28 * * * * Debian-exim for file in `find $SPAMDIR -type f`; do cat $file; done | grep -v -E "^X-Spam-(Checker|Flag|Level|Report)" | sed s/"^X-Spam-Status.*score.*required.*tests="//g | /usr/bin/bogofilter --register-spam

Obviously, if you want it to learn from plenty of different users, you’ll have to think of something more elaborated 🙂
Anyway, regarding plenty of users, it would actually probably wise to think twice about the whole concept of sharing bayesian filters that may not at all be accurate for very differents users.

One alternative would have been to avoid meddling with Exim and to run both bogofilter and spamd via procmail. Sure, it would not have been site-wide setup but for a few users, ~/.procmailrc can be replicated easily. But actually I enjoy messing with Exim, that’s kind of a hobby. I skipped here the part where we call DNSBLs in Exim (working out-of-the-box anyway). And on a production server, with the SMTP wide opened to the web, it is possible to follow this approach just to shut off spammers at SMTP-time -which induces a huge resources gain- and even ban them.

Underclocking, going backward?

Do you remember back in the days when a Pentium III doing 600GHz was awesome? At that time, when guys at Intel were foretelling that the increase of the processors clock rate will have no end, or at least none that they could possibly envision, you’d see that oath as testimony of the faith in a future of endless possibilities, gaming-wise.

Later on, Intel went as far as publishing a Pentium 4 which was degraded version of the Pentium III. Less complex, less instructions, it was able to go higher in clock rate than any Pentium III, something like 1.8GHz easily. It went on. People even bought laptops with Pentium 4 2.6GHz. And then people started noticing: hey! it’s winter, it’s freezing damn cold outside but I’m not even forced to turn the heat on! Or funnier, hey, why do my brand new laptop is making more noise than a vacuum cleaner? And what black magic made power supplies became a noticeably costly component of a computer?

Well, that’s all about physics. And there’s not much to do about. The faster the computer processing unit run, the more energy it will burn, the hotter it will get.

AMD was smart enough to soon start shipping processors with lower clock rate than Intel ones for the same effective potential. It was also smart enough to brand them accordlingly, branding them for instance something like 3200+ to tell they would be as potent as a Pentium 4 3.2Ghz, while they had a way slower clock rate.

Intel could surely not completely obviously go backward – and publicly recognize AMD wittier. But they could not loose the growing market of the laptops, where the heat issue (not to mention the impact on the batteries life) was too much of a problem with Pentium 4, so they invented the Pentium M… based on the Pentium III, of course.

Considering the unavoidable antagonism beetween fastness and energy consumption, the best idea that someone (who, I do not know) came up with was to enable the operating system to set the clock rate according to the current need. It comes with many different names (Cool & Quiet, whatever) and I believe is it now available with most recent processors. On Debian, you just have to install cpufredutils and load the relevant kernel module cpu (powernow_k8 for instance on my AMD Athlon 64 X2 Dual Core workstation) and then pick a policy. Yes, you have to pick a policy, like on demand, performance, etc.
Obviously, there is a performance loss (hence the name –performance– of the policy which actually only set the clock rate to max) since you are not always running the fastest possible: there is always a delay needed for the operating system to understand that now you need full speed when it was idle just before. The different policies purpose is to optimize this delay – tuning inertia, in which regard on demand is simply harsher than conservative.
Next step would be to have the operating system guessing if you’ll need full speed or not according to what you are actually doing (which software do you run, etc) and what you are about to do according to past usage (yes, logging what you use to do and making guess).

So currently, on a workstation like mine, using cpufreq on demand is probably a wise choice. Most of the time, it will run slower than it could, because you do not need full power of a recent processor to browse over the web, reply to mails and whatever crap like that you may want to do. And when you’re compiling a piece of code, when you are encoding a piece of music, then you’ll have full power. I never or rarely use GNU/Linux to play games so inertia is not a crucial issue – however, to play games, it would surely be best to set the policy to performance, even if after the game is started it will likely, anyway, request full power (surely, you configure your games to the best resolution, anti-aliasing, etc, that your box can handle, don’t you?).

(Not to mention that, gaming-wise, graphical cards now do a big part of the job, the most important anyway, making CPU less important by comparison to so-called GPU… but that’s another story)

So, now, I’ll get straight to the point. I run also a little shuttle box as local server. It serves files, it is up 24h/24 and do plenty of small things. It comes with a Celeron 2.6GHz but it surely would do as well with a slower clock rate. With in mind the idea of reducing the heat of this processor as much possible, I searched over the web on the subject of underclocking. The mainboard of the shuttle, by design, does not allow to make this processor run slower than it does. There is no possibility of playing with cpufreq or alike with a Celeron – which is actually a crappy Pentium (no L2 cache, less instructions, etc).

Pentium 4 2.80GHz running at 1.40GHz
I found however interesting the idea of buying a processor designed to run with a faster front side bus than the actual mainboard we have. It focus on the fact that the processor clock rate is actually determined by both the clock multiplier and the front side bus (FSB).
This shuttle front side bus runs at 400MHz. If I pick a processor, says, designed for a 800MHz front side bus, which is usual of Pentium 4 around 3GHz, it will run twice slower.

So I spent nine euros to get a (used – but the Celeron 2.6Ghz is not brand new either) Pentium 4 2.8GHz. And now, my shuttle runs 1.4GHz. Processor temperature is around 35°C, and the sole fan of the box is around 1300RPM. Nice side effect, this processor got Intel’s Hyper-Threading (simili multiprocessor), which is definitely good for a server.

I guess I could even use cpufreq with this new processor. But I’m not sure it would let me go as far as reducing by a factor of two the clock rate. And anyway, I’m happy in this case with such clock rate set up by hardware.

The only remaining thing to do is to undervolt it now.