social.makerforums.info is one of the many independent Mastodon servers you can use to participate in the fediverse.
Microblogging for Makers. Part of the Maker Forums community. Maker Forums Discourse: https://forum.makerforums.info/

Server stats:

16
active users

Public
Should you be wondering why @LWN #LWN is occasionally sluggish... since the new year, the DDOS onslaughts from AI-scraper bots has picked up considerably. Only a small fraction of our traffic is serving actual human readers at this point. At times, some bot decides to hit us from hundreds of IP addresses at once, clogging the works. They don't identify themselves as bots, and robots.txt is the only thing they *don't* read off the site.

This is beyond unsustainable. We are going to have to put time into deploying some sort of active defenses just to keep the site online. I think I'd even rather be writing about accounting systems than dealing with this crap. And it's not just us, of course; this behavior is going to wreck the net even more than it's already wrecked.

Happy new year :)

@corbet @LWN I'm wondering if a link that a human wouldn't click on but an AI wouldn't know any better than to follow could be used in nginx configuration to serve AI robots differently from humans, in a configuration that excluded search crawlers from that configuration. What such a link would look like would be different on different sites. That would require thought from every site, but also that would create diversity which would make it harder to guard against on the scraper side, so possibly could be more effective.

I might be an outlier here for my feelings on whether training genai such as LLMs from publicly-posted information is OK. It felt weird decades ago when I was asked for permission to put content I posted to usenet onto a CD (why would I care whether the bits were carried to the final reader on a phone line someone paid for or a CD someone paid for?) so it's not inconsistent in my view that I would personally feel that it's OK to use what I post publicly to train genai. (I respect that others feel differently here.)

That said, I'm beyond livid at being the target of a DDoS, and other AI engines might end up being collateral damage as I try to protect my site for use by real people.

Public

@corbet @LWN Also, the link that a human wouldn't click on should be <meta name="robots" content="noindex"> and a robots.txt section

User-agent: *
Disallow: /the/honeytrap/url

That way, all well-behaved robots that honor robots.txt, including search engines, would continue to work, and only the idiots who think they are above the rules will fall into it.

Public
@mcdanlj @LWN What a lot of people are suggesting (nepethenes and such) will work great against a single abusive robot. None of it will help much when tens of thousands of sites are grabbing a few URLs each. Most of them will never step into the honeypot, and the ones that do will not be seen again regardless.
Quiet public

@corbet @LWN Oh no, you are right. It was such an enticing idea.

I want to be search-indexed. I care less about VPN access from VPNs that just rent cloud IPs; much of my spam comes in that way anyway, and it's not clear that many site users actually do. If I can distinguish those I might add a lot more ASN blocks. 😢

Public

@mcdanlj Even if you are OK with GenAI being trained on publicly available data -- those scrapers should abide by convention and if a robots.txt says "no" they should honor it. The fact they do not says volumes about the ethics of the people behind a lot of these GenAI companies.

@corbet @LWN

Public

@jzb @corbet @LWN I'm saying I'm not reflexively "all AI is evil" and I still am beyond incensed at this abuse. I completely agree that dishonoring robots.txt, not providing a meaningful user agent, and running a continuous DDoS is a sign that they are morally corrupt.

The difference between Anthropic and a script kiddie with a stolen credit card or a botnet is that the script kiddie will eventually get bored and go attack someone else, as far as I can tell.