SpamBayes and a plan for Spam
I’ve been working on setting up my virtual server which runs this site and Kiwi, and I’ve been tinkering with the mail system (of course). I have Postfix running as the MTA, which sends incoming mail to Procmail, which deposits my mail into a series of Maildir directories. From there, I can use the wonderful Dovecot (maybe the best IMAP server out there right now) to view my mail. The only thing that is missing is SPAM protection. When I was on Dreamhost I had SpamAssassin doing the SPAM filtering for me, this past week without any sort of filter and I’ve been hit with a barrage of junk e-mail.
I’ve done the basics, like configured Postfix to reject senders without a valid domain. I installed and ran SpamAssassin for about a day, but SpamAssassin is huge! I mean, really, it gobbles up RAM. Normally this wouldn’t be a problem, but I’m running on a VPS with 96MB of RAM, and I’ve got a bunch of services, I can’t afford to have SpamAssassin taking up 40 MB! By default it’s even worse, since SpamAssassin starts with 5 processes on Debian, each of which take 20MB of RAM. Yikes! I did the natural thing and scaled it back as much as possible, 2 processes, but that is still 40MB of RAM.
So what’s the solution? I’m happy to say that I’ve found something that uses fewer resources and has filtered out every piece of junk mail: SpamBayes What is SpamBayes? It’s server side Bayesian filtering software that you train based off of the mail you receive. I’m going to present here a few simple directions for setting up SpamBayes, however I’m making a few assumptions:
- That you have an MTA (mail transfer agent) such as Postfix or Exim setup
- You have Procmail configured to be your MDA (mail delivery agent)
- You’re using Maildir (these steps can be easily modified for mbox or mh)
- You’re using IMAP
So on with the show:
Get SpamBayes from the website (it’s a series of Python scripts), or if you are on Debian
apt-get install spambayesCreate a SpamBayes database by running
sb_filter.py -nin your home directoryCreate a simple configuration file ~/.spambayesrc which tells SpamBayes where to find your database. Here is the config file I used:
[Storage] persistent_use_database = True persistent_storage_file = ~/.hammiedbEdit your .procmailrc file so that it invokes SpamBayes when e-mail is delivered. I also added some filters to filter out e-mail that SpamBayes tags as Spam, and e-mail that it is unsure about. Edit the path to sb_filter.py as appropriate, and note for my setup this puts Spam in a Spam folder and unsure mail into, you guessed it, Unsure:
:0 fw:hamlock | /usr/bin/sb_filter.py :0 * ^X-SpamBayes-Classification: spam .Spam/ :0 * ^X-SpamBayes-Classification: unsure .Unsure/Add a line to your crontab file so that every night SpamBayes learns by looking at e-mail you’ve put in the Spam folder, and your Inbox. Run
crontab -eand add this line:10 0 * * * /usr/bin/sb_mboxtrain.py -g /home/mronge/Maildir/cur -s /home/mronge/Maildir/.Spam/cur
What does that line do? It says that every night, 10 minutes after midnight, run sb_mboxtrain.py (despite the name, it works on mbox, Maildir, and mh), where good mail is stored in the Inbox located at /home/mronge/Maildir/cur and Spam is stored in /home/mronge/Maildir/.Spam/cur. Of course you’ll have to adjust your paths above as necessary for your own system. This way SpamBayes gets smarter every day but scanning your e-mail. You can even add other folders so that SpamBayes can train off of your archives or mailing list folders, for me I filter out mailing list e-mail in my .procmailrc, so I don’t train off of it.
After that, you’re done. E-mail which is left in your Inbox will be used to train what a “good” message looks like, and mail that you move to your Spam folder will be used to train the system on what spam looks like. Also, make sure you check your unsure IMAP folder, and move any mail that SpamBayes is unsure of to it’s proper location. Hopefully, like me, you’ll have success in filtering out Spam without the performance penalty of running SpamAssassin. As a final note: I realize these directions assume quite a bit of unix and sysadmin knowledge, if you have any trouble with the above feel free to leave a comment and I’ll do my best to help out. Another final note: Let me know if you spot any grammatical mistakes (or if I’ve got some of the technical details screwed up).