Why is gmail so good at spam filtering?

Today I posted a message indicating gmail’s spam filters are good and Greg mentioned that it sounds like gmail is much better at this than the University of Washington.  This reminded me I like blogging and I wanted to write short background on what I know about this topic (which isn’t an incredible amount, but it’s a fair deal and more than most people probably want to know). Before talking about why gmail is good at spam filtering, it’s worth identifying a couple entities involved in junk email, or spam.

  • Spam is unsolicited junk email.
  • Ham is the email that gets through spam filters. Not all of this is email you want – it’s just what gets through the filters.
  • A false positive in spam filtering means something gets tagged as spam and gets filtered from your regular view of email but it was email that you wanted to see.
  • A false negative is a miss in spam detection.
  • A spam filter is a system used for sifting through your incoming email, applying a set of rules, and identifying its likelihood of being spam or not and taking an action based on that.
The title of this post asserts that gmail is good at filtering spam and I think most people who use it would agree with that.  Before switching to gmail, I maintained my own POP3 server with a private hosting company and immediately learned that doing this without some spam filtering system yields a totally unacceptable email experience.  It’s absolutely necessary for anyone who wants to use email (and not get overwhelmed with junk email and doesn’t try very, very hard to live “off the grid” in some sense) to not have some spam filtering.
So at the time I used SpamAssassin – it was very good. Most spam filters evaluate email messages against a set of rules that give the message a score indicating its likelihood to be spam or ham. These scores are evaluated with Bayes’ theorem to get some aggregate likelihood that the message is spam or not and a tolerance is defined in that system for ultimately deciding whether the message is shown.  I may be oversimplifying some details, but that’s the general approach and I suspect something like it is at least a part of gmail’s spam filtering (if not all of it).
So I mentioned SpamAssassin was good – why move away from it?  I don’t really think there is a good reason and if I were still maintaining my own mailserver, I would almost definitely continue to use it. But I’m not, and I don’t want to and there are tons of great engineers who work at Google who are trying to tempt me to not care about stuff like this and let them do that work for me and I let them.
Now to get to the point – why is gmail’s spam filtering good and why might it be better than a lot of other systems out there?
  • When you use gmail, you agree to give google a LOT of your personal information.  And they are very good at turning semi-unstructured data (like multiple GB of email) and finding patterns in it that can be useful for building rules that simpler systems don’t have access to.
  • Your mail and contacts are one. In most personal or hosted mail systems, your address book might seem like it’s in the server, but it might not be.  It might be stored on another server that sits right next to the server that your mail is on, but the spam filters might only have access to your email and not know who are the people in your personal address book that you want to always allow to send you email.  Google and gmail definitely know this, so even if your brother sends you a message that fires 10 alarms that make it look like spam to most spam filtering systems, Google might be able to have the “contact” rule trump those other rules.
  • Google has all the other email in gmail to use to identify spam, too. Say some spammer crafts a clever message and it gets through every spam filter in existence.  Now 5,000 gmail members all see it and mark it in their inboxes as spam – you are customer number 5,001.  I don’t know that google/gmail *do* this, but they could certainly use that as a filter, too, to retroactively identify the message as spam and yank it from your inbox and push it to the spam folder.
To summarize: Google have tons of engineers working on this.  They’re good at aggregating data.  They have a lot of data about you to pull from beyond simply “what’s in the email” to determine whether a message is probably spam.  And they have a lot of data from other people, too, to tell whether something is spam.  All of that adds up to, for me, almost never seeing spam and almost never having legitimate messages flagged as spam.

en.wikipedia.org/wiki/Bayes’_theorem

1 Comment »

  1. crowther said,

    December 17, 2011 @ 5:33 am

    Patrick, thanks for this expanded look at the issue. I think you’re right that UW’s email filters are not tied directly to my address book.

RSS feed for comments on this post · TrackBack URI

Leave a Comment