Cloudflare has added a new option in its settings to help its customers easily block AI bots, scrapers, and crawlers with a single click.
Companies building generative AI models use web scrapers to get content to train their models with. “Google reportedly paid $60 million a year to license Reddit’s user generated content, Scarlett Johansson alleged OpenAI used her voice for their new personal assistant without her consent, and most recently, Perplexity has been accused of impersonating legitimate visitors in order to scrape content from websites. The value of original content in bulk has never been higher,” Cloudflare wrote in a blog post.
According to Cloudflare, AI bots accessed around 39% of the top one million websites that use Cloudflare. The more popular a website is, the more likely it is to be targeted by those bots.
To prevent these types of bots from visiting websites, customers can toggle on the “AI Scrapers and Crawlers” option under Security > Bots in their Cloudflare dashboard. This option is available to all Cloudflare customers, even those on the free tier.
Cloudflare distinguishes which bots are scraping data from sites by analyzing traffic across its network and providing a Bot Score, which is a measure of the likelihood that a traffic source is a bot. Its machine learning model is also able to successfully detect bots that are using a spoofed agent to try to appear as a real browser in order to avoid being detected as bot.
In addition, by using globally aggregated data, Cloudflare is easily able to detect new scraping tools and behavior so that it can continue protecting customers from new bots. According to the company, it will also keep improving its detection processes over time as things evolve.
“We fear that some AI companies intent on circumventing rules to access content will persistently adapt to evade bot detection. We will continue to keep watch and add more bot blocks to our AI Scrapers and Crawlers rule and evolve our machine learning models to help keep the Internet a place where content creators can thrive and keep full control over which models their content is used to train or run inference on,” the company wrote.
You may also like…
Q&A: Bad bots and their impact across the internet
Report: Nearly half of all internet traffic is bots, a third is bad bots