A buddy of mine recently set up a website and he was stressing that Google had not yet paid him a visit. Yahoo Slurp and MSNbot were stopping by regularly however. As we continued to talk the conversation took a turn in the direction of spider activity and the fact that none of the popular log analysis programs provide much detail concerning what spiders actually do on a website.
A quick Google search turned up RobotStats, a very powerful SQL-based solution that works pretty well once you apply several patches to the PHP code. The trick is that the modifications are documented in the RobotStat forum, in French. The author promises a new version “Real Soon Now”. Of course, my friend’s $2.99 a month hosting package does not include a SQL database so RobotStats was out of contention.
He does have PHP though and after about five minutes of searching I located some PHP code that promised to shine some light on his robot activity. Sorry, I don’t recall where I found the beginnings of this code so I can’t give credit where credit is due. I have added some functionality as the core script did not suport Yahoo Slurp nor MSNbot, both a distant second in terms of search engine importance but important none the less. In fact, Yahoo continues to be a little more important each day since they now have theor own search spiders. I currently derive about 10% of my seach engine referrals from Yahoo. Others get as much as 25% of their search engine traffic from Yahoo.
Search engine referrals aside, I threw together a little PHP magic to address my buddy’s problem. You are welcome to the code too.
Bot Detector Plus Requirements
To utilize these scripts you need a few things:
- a website/web host that supports PHP
- CRON or some other scheduling utility
- web pages that the search engine bots actually visit
- the MAIL command on your host
Not really a lot is needed and the few items that are required are very common on hosted accounts. Of course, you will run into some pretty insurmountable problems if you undertake this little project on GeoCities or Tripod.
Bot Detector Plus
Cool name, huh? I needed something…
The bot detection code is comprised of three components. First, the core detection script should be added to all web pages you want to monitor for bot activity. You may include the actual PHP script on each page but I recommend that you use an include statement. All that talk about modularity finally sunk in, I guess.
Here is the bot detector code. I recommend saving this code to a file on your site named bot-detector-plus-inc.php or something similar.
Pretty straightforward so far? You will need to modify the script to write the text files to a useful location in your websites file system. You will also need to include the file on each page you want to monitor for spider activity like so:
The second component is another PHP script that is used to mail the bot reports to you whenever you fancy. Daily seems reasonable but your mileage may vary. This script should be added to a new, essentially hidden page on your site since it emails the log contents to you each time the page is executed and then wipes the logs clean. You should use CRON or your operating systems equivalent to execute this page.
Here is the email script. I suggest you create a new webpage using the name email-bots.php and add this script to it. No text is necessary beyond the text affirmation provided by the script.
Again, pretty straightforward. You will need to modify the location that you are storing the text files in the script. You will also need to put your email address in the space provided and possibly change the subject of the email to reflect your website name.
Next, add a new entry to your crontab along these lines:
3 0 * * * /usr/local/bin/php /the/location/of/your/email-bots.php
This entry says execute the webpage email-bots.php using the PHP engine every morning at 12:03. The location is where you stored the PHP webpage with the code above. The location of PHP on your system may not be /usr/local/bin/php. Typing whereis php in your UNIX shell should provide you with its location.
Finally, there are the logs themselves. Three logs are maintained; one for each search engine. These files are simple text files with one bot page request per line including the page that was requested and the IP address of the bot requesting the page. You should create these files yourself using TOUCH (or a text editor and save the files empty) and CHMOD 777 them (or make them read and writable by the world.
That’s it. Not so bad, huh? Now you have a primitive robot logging tool that will report to you everyday which search engine robots visited your site and which pages they spidered. The best part is that you can add additional search engines quickly and monitor till your hearts content.
These instructions, while detailed do assume a certain level of Unix administration capability and knowledge. If you don’t have it, get it. If you are lazy, find a friend to help you. Note: I am a not your friend. I like you but not that much.
I have plenty to keep me busy… If you just can’t figure it out and you find yourself without a friend with the requisite knowledge you may as a last resort email me. I will certainly entertain offers that begin with “where do I send the money” but I am not in the habit of doing this sort of thing and prefer to keep it that way.
By all means email me if you improve the script. I will be happy to incorporate and share the fruits of your work with everyone.

2 Comments Received
Pingback & Trackback