ianhenderson.org / 2025 april 3
Recently, some people have decided to operate automated web crawlers that ignore robots.txt and do not use reasonable rate limits, causing substantial resource usage on services that were not designed to handle the load. This page serves as a demonstration of a technique that could be used to protect services from these crawlers without requiring JavaScript or sending your data through third parties with dubious principles.
The key observation is that these crawlers send only GET requests. So, to block the crawler from seeing a costly-to-render page, you can to render the page only in response to POSTs. Then, when linking to the page, instead of an <a href='...'> link, create a button that submits a form with the destination as its action. The links in this paragraph are POST links, by the way! Here's a normal, GET link to the link-styling page if you want to see what it looks like if you request it via GET.
Of course, this assumes that the crawlers won't start sending POST requests in the future. To send these requests would be a pretty big escalation, however—going around the web submitting every form you see could do much more damage than just increasing web server load.
One final note: if it is possible to optimize your code or add capacity to handle the additional load, consider spending resources on that rather than on mitigation strategies like this. Making the service work faster will make it better for all users, while adding barriers will make it worse for everyone. And if you're one of the people operating these crawlers for AI training, I believe you have an obligation to release the weights from your trained models—it's difficult to justify scraping our data freely and then keeping the result private to make money for yourself.