My hosting provider provides built-in support for two different web log analysis packages: Analog and Webalizer. I’ve tried both of them, but neither gave me very good control over the data, so I decided to switch to AWStats which seems to be fairly popular. At first the installation steps were kind of painful, seeming to focus on an installation where I had full control over Apache (which I don’t). I finally found a short write-up from another Pair user that didn’t need the AWStats install script.
Some background: My site has recently become bombarded with thousands of requests originating from China, some which might be from the Baidu search engine, although I’ve read of other bots that impersonate Baidu, so it’s hard to tell. A typical request will be fetching something like “http://www.yd136.com/click.asp?id=857” which is clearly not a page under cantoni.org.
The result: I knew my real data would be hard to find under all those bogus requests, but was amazed to learn that 90% of my page hits were bogus (for February 2006, my logs had a total of 1.1M entries, of which 100,000 were valid). I’m now rejecting by (sometimes large) IP blocks to reject the sources of these bogus hits. By setting the right criteria in AWStats, I’m able to filter out those bad requests to find the real data.
The scoop: So what did I learn from the month of February?
- Publishing a screencast drives up the bandwidth in a hurry
- All the major search engine spiders are visiting my site each day
- Of links attributed to a search engine, about 90% are from Google, 5% from Yahoo, and the rest less than 1%
- My PDA links page is still the most popular, drawing 5 times more visits than the next page
- The most popular search term was “Chuck Norris Facts“!
Next step: Set up my hosting provider and AWStats to also track stats for some smaller sites I’m running. I also need to dig into AWStats some more to see what adjustments I can make, including adding support for mobile device user agent detection.