Main Page Content
Using Apache To Stop Bad Robots
The honest truth about bad robots
For just about as long as the commercial Internet has existed, SPAM email has been the bane of users worldwide. The harder and harder we try to fight the spammers and keep our email addresses out of their hands, the smarter they get and the harder they fight back.One example of peoples attempts to fight back is the large numbers of joe@NOSPAM.email.com, NO.mary.SPAM@REMOVESPAM.mary.com, etc email addressesyou find on Usenet and web based communities these days. Worse yet, many people hold back from contributingto online discussions for fear their email address will be available for evil web spiders (I call them Spiderts - A web spider with a Catberttype personality) to harvest and exploit from mailing list archives.As one who runs(and uses!) evolt's mailing lists, keeping thousands of people's email addresses out of the tentacles of Spiderts has always been a big concern of mine. At first, it was easily remedied by using the %40 'trick'. Instead of writing archives with an easily recognizable email address (abuse@aol.com for example), I had our mailing list software write all email addresses as abuse%40aol.comThis still allowed for a fairly easy to read address for humans while maintaining the ability to click the mailto: link
and have one's associated email client create a new message with the correct email address entered. The Spiderts wouldn't recognizeabuse%40aol.com as a valid email address and therefore not harvest it.This was a fairly good solution until its use became widespread, at which point the creators of the Spiderts tweaked their unholy creations to recognize abuse%40aol.com as a harvestable email address and siphon it as well. As if it couldn't get worse, it was also becoming apparent that the newer generations of Spiderts don't play by the rules set out for web spiders, and would disregard any "Disallow: /" entries in the robots.txt file. In fact, I've seen Spiderts that only go for what we specifically tell them not to! What's a webmaster to do?!?Setting the trap
The first step in our war against the Spiderts is to identify them. There are many techniques to find out who the bad bots are, from manually searching your access_logs to using a maintained list and picking which ones you want to exclude.At the end of the day it's getting the robots name - its User-Agent - that's important, not how you get it. That said, here'sa method I like that targets the worst offenders.Add a line like this to your robots.txt file:Disallow: /email-addresses/
where 'email-addresses' is not a real directory. Wait a decent amount of time (a week to a month) then go through your access_log file and pick out the User-Agent strings that accessed the /email-addresses/ directory. These are the worst of the worst - those that blatantly disregard our attempts to keep them out and fill our Inboxs with crap about lowering mortgage rates. An easy way to geta listing of those User-Agents that did access your fake directory (my examples are with grep and awk, win32 folks can check outCygwin tools) with a combined access_log format is with the following command: grep \/email-addresses access_log awk '{print $12}' uniq
This simply searches the access_log file for any occurrences of /email-addresses, then prints the 12th column (Where $12 is the column of your access_log that contains the User-Agent string) of its results, then filtersit down so only unique entries show. More on grep and awk can be found at theGNU software page.Now that we have their identities, we can put the mechanisms in place to keep these hell-spawns away from our email addresses.Hook, line and sinker
Here are a couple of the User-Agents that fell for our trap that I pulled out of last months access_log for lists.evolt.org:Wget/1.6EmailSiphon
EmailWolf 1.00To learn more about these and other web spiders, check out http://www.robotstxt.org.Now that we have the names of what these Spiderts go by, there are a couple ways to block them. You can use mod_rewrite asdescribed here, but mod_rewrite can be difficult to configureand learn for many. It's also not compiled into Apache by default, which makes it slightly prohibitive.We're going to use the environment variable features found in Apache to fightour battle, specifically the 'SetEnv' directive. This is a simple alternative to mod_rewrite and almost everything needed is compiled in to the webserver by default.In this example, we're editing the httpd.conf file, but you should be able to use it in an .htaccess file as well.The first line we add to our config file is:
SetEnvIfNoCase User-Agent "^Wget" bad_bot
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
The 'SetEnvIfNoCase' simply sets an enviornment (SetEnv) variable called 'bad_bot' If (SetEnvIf) the 'User-Agent' string contains Wget, EmailSiphon, or EmailWolf, regardless of case (SetEnvIfNoCase). In english, anytime a browser with a name containing 'wget, emailsiphon, or emailwolf' accesses our website, we set a variable called 'bad_bot'. We'd also want to add a line for the User-Agent string of any other Spidert wewant to deny.Now we tell Apache which directories to block the Spiderts from with the <Directory> directive:
<Directory "/home/evolt/public_html/users/"> Order Allow,Deny Allow from all Deny from env=bad_bot</Directory>In english, we're denying access to the /home/lists/public_html/archive directory if the environment variable exists called 'bad_bot'. Apachewill return a standard 403 Denied error message, and the Spidert gets nothing!Since most of the email addresses of members are found in lists.evolt.org/archive, this should suffice, but you'llprobably want to adjust a couple things to fit your needs.There are many resources on the Web for discovering the User-Agent strings of Spiderts. The difficult part until now has been the process of actually blocking them from your server. Thankfully, Apache provides us with the ability to easily blockthose harbingers of SPAM from our servers and most importantly, our online identities.