This directory is dedicated to fighting the referrer spam. Specialized for an Apache server. 1. What is referrer spam? If you run a site, and make statistics for it, you will probably have a "Top Referrers" chart in them. And if you do it for some time, this chart may turn occupied by sites that you never heard of, and offer porn, drugs or herbal wonder-remedies, gambling, mortgages, etc. Of course, you will never find in these sites a link to yours. (Not speaking that their visitors will probably not be very interested in your site.) Well, how's that they refer you so much? The answer lies in a little program, typically called a referrer spambot. It calls your site, requests anything (most often the base dir, as it is sure to be there), and describes one of these "referrer" sites as the referrer who sent it to you. The server writes this request in its logs, and your statistics program reads them, and voila! :-( The referrer spambot will typically be installed in the home PC of an unsuspecting user (this is called "zombifying", and such a machine - "zombie PC"). So, you can easily track who is calling you, but that will often be some grandma who doesn't know what "referrer" is. Large amounts of such machines are typically used and commanded together from afar by their hidden owner. These are called botnets. In addition to referrer spamming, they are used for other nice activities - e-mail spamming, flooding websites to a halt (the so-called "distributed denial of service (DDOS) attack"), etc. Is this your problem? Certainly. People may like to see who recommends you so hotly. Thus, they feed the spammer, who is paid for increasing the visitors to these sites, and you start getting more and more spam - referrer, email, etc. In addition, these people will typically get infested by these sites with spambots, viruses etc. malware (what else you would expect from people who advertize in this way?). And may blame you for getting the malware. Also, these requests eat your bandwidth. If you get one request per second (many sites get hundreds and thousands of such requests per second), your browser sends your web page in response, If it is 20 KB large, this means 20 KB/sec wasted, eating from the bandwidth you pay for. It is much better to reject these requests, and save your bandwidth (and money). I'm sure you can think of better things to waste them on. :-) Moreover, if you "advertize" such places, Google, MSN, Yahoo and the other net searchers will decrease your page rank - your site will be further and further down in their search results. Some sites and providers may block you at all. So, you will have enough problems with them to care about this. There is no perfect solution to this problem. Here is the best I could do. Make a better thing, and I will use and recommend it. 2. How to use it. First, take a look in the spamlists subdir. You will find there several text files with typical spam words, and a Perl script. 2.1. The spamword list files The files list typical words and combinations found in the names of sites that are spam-referred. In the ideal case, it will be a list of the exact names of the sites. However, this is not practical, for two reasons: - Every time a request is made to your server, Apache matches it against ALL "reject" descriptions. If you have a slower processor, this could be a strain, and a delay for the users. So, fewer, but common to more spammers words ("porno", "shemale", "viagra", "supplement" etc) are preferred. - Typical words are common for most such sites. For example, a lot of porn sites will have in their names "porno", "pantyhose", etc. (Very few legal sites will have them, too - but if it is likely that such a site is your big referrer, and wants this to be shown, you will surely know this, and may edit the lists to work around.) By using such words, you block not only sites that have already shown, but also ones that appear right now. The words in the list files act as regular expressions. ("Huh?" This means that they may be also elaborate "masks", one of which matches a lot of things. You also must "escape" some characters, as a payback for this - eg. "." must be written as "\.", for example, "bigfuck.com" should be "bigfuck\.com". You can learn a lot more for the regexes (regular expressions) by typing on some Unix, eg. Linux or BSD, "man 7 regex".) These files may contain blank lines and comments (lines that start with "#"). These will be omitted while creating the working files from them. 2.2. The Perl script. This script generates from the spamword files two result files: a single file with all spamwords, and an Apache config include file that describes which requests to reject. The single file will all spamwords is good for using with programs that clean the spamming references from your Apache logs. One way to do this is the clean-log.sh script, which you surely see. This spamword file, by now, is made so that clean-log.sh will look not at exactly the referred field in the logs, but everywhere in the log line. The probability for mistakes is small, but exists. Those who would like precise matching to the referrer field, may edit the Perl script (there are instructions inside what to comment and what to uncomment :-) ). Keep in mind, however, that this way cleaning will be SLOW! The Apache config file that tells the server to reject spamming requests must be included in the Apache configuration. Best add to the /etc/apache/httpd.conf file, near the bottom, a line saying something like this: Include /etc/apache/referrer-spammers.list Then, restart Apache, or tell it reload its configuration - eg. type as root: /etc/init.d/apache reload Apache must have mod_rewrite enabled. (Most distros have it by default.) If it complains that the "RewriteEngine" keyword is unknown, this module is NOT enabled; read the Apache docs to learn how to enable it. If this script does not work at all, check whether you have Perl installed. Novadays, it is installed by default on almost every Unix system around. Best tactics, from my experience: 2.2.1. Include a symlink to create_referrer_spamlists.pl in /etc/cron.daily directory, or call it from within a script in this directory. 2.2.2. Wait a day, and check if it has created the two result files. If yes, put the "Include" statement in your httpd.conf, and set up the clean_log.sh file. 2.3. The spammerlists Another directory you will find here is called "spammerlists". In it, you will see several files that are lists of known referrer spammers (actually, the sites that are spamvertized, that is, advertized by spam). The files are: - sites-black.list - a file that contains a list of all sites that were spamvertized. - sites-white.list - often spammers advertize "good" sites, in order to make your fight against them harder (you can't anymore trust that every link in a spam is "bad"). This is a list of "good" sites that have been spamvertized here. - domains-black.list - often spammers present to you myriads of sites that are subdomains in the same domain; it is wise in such a case to block the entire domain. This file is a list of spammer domains. - domains-white.list - like sites-white.list, but for the domains. - IPs-black.list - the spammer sites are myriads, but most often they are based on a small number of servers, run by spammers. This file lists the IPs of these servers. They are expected to contain nothing but spammers stuff - it's safe to block them. If you have problems, ask some experienced Apache / Linux / BSD sysadmin around. If you have suggestions and improvements, make and distribute them proudly. (And note in this file what you have done.) May the source be with you. Grigor Gatchev