Many of our customers complain about fake Googlebots crawling their websites. They seek a way to prevent the fake crawlers from scraping their sites but want to ensure that the real Googlebot (or Bingbot, etc) is not blocked. Not only do these bad guys consume bandwidth, they also indulge in hot linking, comment spam and other web application attacks.
Before we look at the alternatives, here is what a fake Googlebot might look like in your server (e.g. apache) logs:
209.321.163.xx - - [24/Jun/2014:14:32:20 -0600] "GET / HTTP/1.1" 200 31375 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
The way to tell it’s a fake: the visitor has identified itself as Googlebot via the User Agent field (in green), but it’s IP address (in red) does not belong to Google!
Following are some ways you could go about blocking such fake bots:
Method # 1: Validating request headers
Fake googlebots can be spotted by the absence or deviation from the standard googlebot headers. The standard googlebot headers are outlined in this Google Webmaster Central Blog post. Here is what some of them look like:
User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
So for example, if the User Agent of a request is spoofed to be Googlebot, but the other headers (like the above) are absent, misspelled or incorrect, then you could create Allow Deny Rules on the Barracuda WAF to block such requests.
Allow Deny rules can be created from the WEBSITES > Allow Deny page. You could use something like the following extended match for a DENY rule:
(Header User-Agent co googlebot) && (Header Accept-Encoding neq gzip,deflate) && (Header From neq googlebot(at)googlebot.com)
If you are unfamiliar with extended matches, the above reads that if the User Agent contains Googlebot but the header Accept-Encoding is not equal to gzip,deflat and the header From is not equal to .. etc, then take some action (deny the request in this case).
Method # 2: Robots.txt and Disallow sections
If the fake Googlebot is carefully adhering closely to the real Googlebot headers, then another approach is needed. Fake bots normally do not honor the robots.txt file. They deliberately try to visit pages in the Disallow section.
Thus you could create a DENY rule on the Barracuda WAF for one (or all) of the Disallowed URLs of your site. Whenever a fake bot (User Agent contatins Googlebot) tries to access these URLs, it is immediately denied. You can also configure the rule to block the bot for a following period of time.
Method # 3: Embedding Honeytrap links
A clever extension to the Method #2 – you could inject a hidden link on your landing pages that is not visible to humans. If desired, you could also Disallow it in the robots.txt file. Hidden links can look like:
<a style="display:none;" href="hidden.html">...</a>
Anyone visiting that link (irrespective of the useragent header) is likely a bad crawler and should be denied with a DENY RULE
This is a great little trick, but requires a small modification of the source pages. If this is feasible, it should be the first line of defense.
Method # 4: Robots.txt and Crawl-delay
Several major crawlers support a Crawl-delay parameter in robots.txt, set to the number of seconds to wait between successive requests to the same server. You could use this directive, then set up a brute force policy (WEBSITES > Advanced page) to detect bots that are coming in at a rate faster than this and block them out.
Method # 4: Ensuring IP address is from Google
If the fake bot adheres to the headers as well as robots.txt file rules and stays away from hidden links, then the last resort is to rely on IP Addresses. The real Googlebot comes from an IP address belonging to Google, like the following:
So a visitor claiming to be Googlebot but coming from an IP address outside of this range should be denied.
This approach does require some monitoring of Google’s IP ranges that are used by their crawlers. They don’t change often, but may change in a year or two.
Note that in all the cases above you could either deny the offending request only or you could block out the offending client IP for a period of time, e.g. 1 hour, 1 day etc. Client IPs may belong to genuine users whose machines have been infected by botnets however, so caution is advised before you block them out for very long periods.
For more information on the Barracuda Web Application Firewall, visit the product page here.