@corbet @LWN I'm wondering if a link that a human wouldn't click on but an AI wouldn't know any better than to follow could be used in nginx configuration to serve AI robots differently from humans, in a configuration that excluded search crawlers from that configuration. What such a link would look like would be different on different sites. That would require thought from every site, but also that would create diversity which would make it harder to guard against on the scraper side, so possibly could be more effective.
I might be an outlier here for my feelings on whether training genai such as LLMs from publicly-posted information is OK. It felt weird decades ago when I was asked for permission to put content I posted to usenet onto a CD (why would I care whether the bits were carried to the final reader on a phone line someone paid for or a CD someone paid for?) so it's not inconsistent in my view that I would personally feel that it's OK to use what I post publicly to train genai. (I respect that others feel differently here.)
That said, I'm beyond livid at being the target of a DDoS, and other AI engines might end up being collateral damage as I try to protect my site for use by real people.
@corbet @LWN Also, the link that a human wouldn't click on should be <meta name="robots" content="noindex">
and a robots.txt section
User-agent: *
Disallow: /the/honeytrap/url
That way, all well-behaved robots that honor robots.txt, including search engines, would continue to work, and only the idiots who think they are above the rules will fall into it.
@corbet @LWN Oh no, you are right. It was such an enticing idea.
I want to be search-indexed. I care less about VPN access from VPNs that just rent cloud IPs; much of my spam comes in that way anyway, and it's not clear that many site users actually do. If I can distinguish those I might add a lot more ASN blocks.
@jzb @corbet @LWN I'm saying I'm not reflexively "all AI is evil" and I still am beyond incensed at this abuse. I completely agree that dishonoring robots.txt, not providing a meaningful user agent, and running a continuous DDoS is a sign that they are morally corrupt.
The difference between Anthropic and a script kiddie with a stolen credit card or a botnet is that the script kiddie will eventually get bored and go attack someone else, as far as I can tell.