Everyone likes a good price comparison tool. When combined with Amazon or eBay, these tools allow you to get the best possible deals on that nice smart watch you’ve been wanting for a while. But do you know how these tools work? Or that the underlying technology can be turned against your website and used to attack you?

Most of the price comparison tools work by using a technique called web scraping. Web scraping is a technique used to automatically collect information from websites. It is commonly utilized by Google and other search engine providers to crawl websites and compile information.

While there are benign uses for the technique, there are other uses that make web scraping a major threat to websites today. These uses range from data theft to targeted attacks. Some of the abusive actions that are performed using web scraping include harvesting email addresses for spam lists, collecting competitive intelligence, plagiarism and republishing of content, publishing comparative pricing, auction sniping, etc. Any or all of these actions can have a negative impact on a business. For instance, a competitor may be crawling your website in order to gather competitive intelligence to use to their advantage. Or another website may be stealing your copyrighted content and publishing it on their site for profit.

In some cases, web crawlers can be used to mount attacks on a website. By using an army of automated crawlers, a malicious actor can overload a web server and bring a site down. In 2013, there was a case identified, where malicious actors tricked the GoogleBot into executing SQL Injection attacks on a website!

Any public website that needs to be placed well in search rankings needs to allow search engine crawlers to index them. However, they should also be able to prevent bad actors from exploiting this need. The Barracuda Web Application Firewall (WAF) makes it easy for you to protect your website from Web Scraping. To tackle increasingly sophisticated web scrapers, our WAF includes multiple protection mechanisms against scrapers, making it easy for you to protect your website.

Web scraping can be configured by navigating to the Websites – > Web Scraping page. Here, you can configure policies to block bots and to review or modify allow-listed bots.

A Web scraping policy is made up of multiple parts. Start by creating honeypots, which include hidden links embedded in responses and encrypted URLs in the ‘disallowed’ section of the robots.txt. When a malicious bot accesses the site, it will trip the honeypots by accessing either the hidden links or by navigating to the disallowed URLs in the robots.txt. These are actions that a good bot, such as the GoogleBot, will not perform.

Most bots are run via automated tools in headless browsers. To identify them, the Barracuda WAF sends a Javascript challenge in the response. When the browser fails to execute the challenge, it is marked as a bot. In addition to this, the Barracuda WAF can also insert a crawl delay time into the robots.txt. Any bot that violates this crawl delay time will be considered to be malicious and blocked.

Search engine crawlers are allow-listed and validated using reverse DNS lookups on their IP addresses – these are valid bots that need to be allowed to index your site for listing on the search engine results. This allow-listing also helps identify fake googlebots, etc.

