Spam Hammer
I think I’m starting to get a little bit of a handle on referrer spam, although I’ve had to be pretty ruthless about what gets filtered. But since my “referrer” page is not published anymore, I consider anyone trying to hit it as a spammer. It’s not perfect, but it’s better, and my CPU usage is now down to acceptable levels. There were 21148 requests for the referrer page, of which all but 2057 were rejected. The problem is that these bastards keep buying new domain names to replace the ones that are blocked.
But along the way I’ve discovered that they’re also hitting my trackback script, to the tune of 1987 hits yesterday. This is a troubling, as it appears to have increased since I’ve begun blocking referrers. Unfortunately, these hits contribute to server load because EE has to validate the “token” (I use randomized trackback URLs) and then filter the content. None of the attempts from yesterday were successful, though, due to the filtering. The problem with these is that there is nothing in the access.log to use to filter on. The request is an HTTP POST, and consequently we can’t see what they were trying to pass. So for now I’m blocking the worst offenders by IP. It’s not likely that any legitimate user will attempt to post more than 10 trackbacks from the same IP in one day.
The following bit of UNIX command-line hackery is what I use to determine the offenders. It reports the IP of each system that has submitted 10 or more trackback requests during the previous day.
grep trackback access.log.2005-10-13 | grep -v 403 | grep -v 503 | awk ‘’ | sort | uniq -c | awk ‘{ if (strtonum($1)>=10) print $1,$2; }’
Here’s an example of the output:
20 212.142.33.108
11 216.56.240.71
56 217.219.39.3
108 219.144.196.226
12 219.93.174.101
21 219.93.174.102
12 219.93.174.105
13 219.93.174.109
26 63.144.59.210
59 63.144.59.211
14 64.89.16.7
10 67.50.44.156
10 82.110.130.58
Finding and printing the referrer spammers who leaked through the filters is a little more challenging, since some of them use a full HTML <a> tag in their referrer and some don’t. I suspect that there is some handy-dandy regular expression that would make this simpler, but I’m not a regex guru. It’s also interesting that some of them (for some reason) are using my own domain in the referrer. I suspect this is a simplistic attempt to get me to blacklist myself, but I’m not sure. Given all that, here’s an example of what I use to identify the worst referrer offenders for the previous day.
grep referrer access.log.2005-10-13 | grep -v 403 | grep -v 503 | grep -v aubreyturner | awk ‘{ if ($11=="\"<a"){ $t=substr($12,6); print substr($t,0,index($t,">")-1)} else print substr($11,2,length($11)-2);}’ | sort | uniq -c | awk ‘{ if(strtonum($1)>=10) print $1,$2; }’
And an example of the output:
215 -
88 http://agrino.org/uichsa/wwwboard/567.html
86 http://agrino.org/uichsa/wwwboard/568.html
86 http://agrino.org/uichsa/wwwboard/569.html
85 http://agrino.org/uichsa/wwwboard/570.html
84 http://agrino.org/uichsa/wwwboard/644.html
48 http://generic-######.splinder.com
204 http://#############.50webs.com
32 http://tinman.cs.gsu.edu/~cscjghx/csc3360/wwwboard/messages/86.html
32 http://www.horrorseek.com/horror/dreadful/wwwboard/34.html
As you can see, there are a lot of ones with blank or “-” for the referrer. Those are particularly troublesome in that they’re hard to block (except by IP, but that’s a losing game). I’m not sure what they intend to gain from hitting the referrer URL without any referrer. All it ends up doing is sending them a nearly-blank page (about 100 bytes of almost static content).
One of these days I guess I’ll glue the above commands together into a nightly job that sends me a report in email. Unless these idiots magically disappear before I get tired of doing this manually…
(Updated to try out word censoring for ###### and a couple of other words…)