Charge Limiting AI Crawler Bots with ModSecurity

AI coaching bots from OpenAI, Anthropic, Amazon, and a dozen different firms at the moment are hitting manufacturing internet servers with the identical aggression as a DDoS assault, and robots.txt isn’t stopping them. This information walks via how InMotion’s methods crew makes use of ModSecurity to implement per-bot charge limiting on the server stage, with out chopping off your website’s…

The Drawback: AI Bots That Don’t Comply with the Guidelines

robots.txt has been the de facto settlement between web sites and internet crawlers for many years. A directive like Crawl-delay: 10 tells compliant bots to attend 10 seconds between requests. Google provides you a method to configure crawl charge via Google Search Console. Conventional search crawlers have operated inside these boundaries lengthy sufficient that almost all sysadmins by no means thought a lot about them.

LLM coaching crawlers are a distinct story.

Beginning in 2024, InMotion’s methods administration groups started seeing a sample of unusually heavy site visitors throughout shared and devoted infrastructure. The supply wasn’t a single bot working wild. It was a number of bots, every operated by a distinct AI firm, concurrently crawling the identical servers with no delay between requests and no respect for Crawl-delay directives. None of them coordinated with one another. None of them wanted to. The mixed load of GPTBot, ClaudeBot, Amazonbot, and their friends hitting the identical server concurrently produces useful resource exhaustion that appears functionally equivalent to an unintentional distributed denial-of-service assault.

That surprises numerous web site homeowners who assume robots.txt is binding. It isn’t. It’s a conference, and these bots aren’t observing it.

Two Choices, One Clear Tradeoff

The blunt instrument is a full block by way of .htaccess. You may deny entry by Person-Agent and the bots cease hitting your server completely. Drawback solved, besides it isn’t: your website additionally disappears from AI-driven discovery methods. For companies that need to seem in AI-generated solutions or LLM-powered search options, blocking coaching crawlers completely carries an actual long-term value.

Charge limiting is the higher path. You sluggish the bots right down to a tempo your server can take in. They nonetheless index your content material. You continue to keep visibility. And when a bot refuses to respect the speed restrict you’ve set, you block that particular request fairly than the bot completely.

How ModSecurity Charge Limiting Works

ModSecurity is an open-source Internet Software Firewall that operates inside Apache or Nginx, inspecting HTTP site visitors in actual time. It’s the identical software that blocks SQL injection makes an attempt and cross-site scripting assaults on correctly hardened servers. What makes it helpful right here is its potential to trace request frequency by Person-Agent and deny requests that exceed an outlined threshold.

The method works in two steps:

Establish the incoming request by Person-Agent string and increment a per-host counter.
If that counter exceeds the allowed restrict earlier than it expires, deny the request with a 429 Too Many Requests response and set a Retry-After header.

That Retry-After header issues. It explicitly tells the bot how lengthy to attend earlier than its subsequent request. A well-behaved crawler will honor it. One which doesn’t get blocked on its subsequent try.

The ModSecurity Guidelines

Beneath are the rate-limiting guidelines InMotion Internet hosting’s methods crew developed and presently deploys. Every rule set targets a particular bot by Person-Agent and enforces a most of 1 request per 3 seconds per hostname.

GPTBot (OpenAI)

# Restrict GPTBot hits by consumer agent to at least one hit per 3 secondsSecRule REQUEST_HEADERS:Person-Agent "@pm GPTBot"     "id:13075,section:2,nolog,move,setuid:%{request_headers.host},    setvar:consumer.ratelimit_gptbot=+1,expirevar:consumer.ratelimit_gptbot=3"SecRule USER:RATELIMIT_GPTBOT "@gt 1"     "chain,id:13076,section:2,deny,standing:429,setenv:RATELIMITED_GPTBOT,    log,msg:'RATELIMITED GPTBOT'"    SecRule REQUEST_HEADERS:Person-Agent "@pm GPTBot"Header at all times set Retry-After "3" env=RATELIMITED_GPTBOTErrorDocument 429 "Too Many Requests"
# Restrict GPTBot hits by consumer agent to at least one hit per 3 secondsSecRule REQUEST_HEADERS:Person-Agent "@pm GPTBot"     "id:13075,section:2,nolog,move,setuid:%{request_headers.host},    setvar:consumer.ratelimit_gptbot=+1,expirevar:consumer.ratelimit_gptbot=3"SecRule USER:RATELIMIT_GPTBOT "@gt 1"     "chain,id:13076,section:2,deny,standing:429,setenv:RATELIMITED_GPTBOT,    log,msg:'RATELIMITED GPTBOT'"    SecRule REQUEST_HEADERS:Person-Agent "@pm GPTBot"Header at all times set Retry-After "3" env=RATELIMITED_GPTBOTErrorDocument 429 "Too Many Requests"

ClaudeBot (Anthropic)

# Restrict ClaudeBot hits by consumer agent to at least one hit per 3 secondsSecRule REQUEST_HEADERS:Person-Agent "@pm ClaudeBot"     "id:13077,section:2,nolog,move,setuid:%{request_headers.host},    setvar:consumer.ratelimit_claudebot=+1,expirevar:consumer.ratelimit_claudebot=3"SecRule USER:RATELIMIT_CLAUDEBOT "@gt 1"     "chain,id:13078,section:2,deny,standing:429,setenv:RATELIMITED_CLAUDEBOT,    log,msg:'RATELIMITED CLAUDEBOT'"    SecRule REQUEST_HEADERS:Person-Agent "@pm ClaudeBot"Header at all times set Retry-After "3" env=RATELIMITED_CLAUDEBOTErrorDocument 429 "Too Many Requests"
# Restrict ClaudeBot hits by consumer agent to at least one hit per 3 secondsSecRule REQUEST_HEADERS:Person-Agent "@pm ClaudeBot"     "id:13077,section:2,nolog,move,setuid:%{request_headers.host},    setvar:consumer.ratelimit_claudebot=+1,expirevar:consumer.ratelimit_claudebot=3"SecRule USER:RATELIMIT_CLAUDEBOT "@gt 1"     "chain,id:13078,section:2,deny,standing:429,setenv:RATELIMITED_CLAUDEBOT,    log,msg:'RATELIMITED CLAUDEBOT'"    SecRule REQUEST_HEADERS:Person-Agent "@pm ClaudeBot"Header at all times set Retry-After "3" env=RATELIMITED_CLAUDEBOTErrorDocument 429 "Too Many Requests"

Amazonbot

# Restrict Amazonbot hits by consumer agent to at least one hit per 3 secondsSecRule REQUEST_HEADERS:Person-Agent "@pm Amazonbot"     "id:13079,section:2,nolog,move,setuid:%{request_headers.host},    setvar:consumer.ratelimit_amazonbot=+1,expirevar:consumer.ratelimit_amazonbot=3"SecRule USER:RATELIMIT_AMAZONBOT "@gt 1"     "chain,id:13080,section:2,deny,standing:429,setenv:RATELIMITED_AMAZONBOT,    log,msg:'RATELIMITED AMAZONBOT'"    SecRule REQUEST_HEADERS:Person-Agent "@pm Amazonbot"Header at all times set Retry-After "3" env=RATELIMITED_AMAZONBOTErrorDocument 429 "Too Many Requests"
# Restrict Amazonbot hits by consumer agent to at least one hit per 3 secondsSecRule REQUEST_HEADERS:Person-Agent "@pm Amazonbot"     "id:13079,section:2,nolog,move,setuid:%{request_headers.host},    setvar:consumer.ratelimit_amazonbot=+1,expirevar:consumer.ratelimit_amazonbot=3"SecRule USER:RATELIMIT_AMAZONBOT "@gt 1"     "chain,id:13080,section:2,deny,standing:429,setenv:RATELIMITED_AMAZONBOT,    log,msg:'RATELIMITED AMAZONBOT'"    SecRule REQUEST_HEADERS:Person-Agent "@pm Amazonbot"Header at all times set Retry-After "3" env=RATELIMITED_AMAZONBOTErrorDocument 429 "Too Many Requests"

Adapting the Guidelines for Different Bots

The construction is similar for each bot. So as to add protection for a brand new crawler, copy any rule set and make two adjustments:

Change the Person-Agent string (e.g., GPTBot) with the brand new bot’s identifier.
Assign distinctive id values and distinctive env variable names to keep away from conflicts with current guidelines.

The id subject have to be distinctive throughout your total ModSecurity configuration. Should you’re including these to an current ruleset, verify what IDs are already in use earlier than assigning new ones. Collisions trigger guidelines to fail silently.

For reference, a rising checklist of recognized AI crawler Person-Agent strings contains Bytespider, CCBot, Google-Prolonged, Meta-ExternalAgent, and PerplexityBot, amongst others. The Darkish Guests mission maintains a fairly present catalogue of recognized AI agent identifiers.

What Occurs After You Deploy

As soon as these guidelines are energetic, a bot that makes two requests to the identical hostname inside a 3-second window receives a 429 on the second request. The Retry-After: 3 header tells it to attend earlier than making an attempt once more.

From there, conduct splits into two classes:

Bots that respect the header decelerate routinely. They proceed indexing your content material at a tempo your server can deal with. Sources are conserved, and your website stays accessible to the crawlers value caring about.

Bots that ignore the header preserve hitting the deny rule on each subsequent request till their inner retry logic kicks in or they transfer on. Both manner, they’re consuming a fraction of the assets they might have with out charge limiting in place.

You gained’t repair the underlying downside of AI firms deploying aggressive crawlers with out consent. However you cease absorbing the price of their indexing operations in your {hardware}.

Conditions and The place to Apply These Guidelines

These guidelines require ModSecurity to be put in and enabled in your server. On InMotion Internet hosting Devoted Servers and VPS plans, ModSecurity is out there via cPanel’s WHM interface underneath Safety Heart > ModSecurity. The principles may be added as customized guidelines via WHM or instantly in your server’s ModSecurity configuration listing.

Should you’re on a managed devoted server, InMotion Internet hosting’s Superior Product Help crew can help with customized ModSecurity rule deployment. Prospects with Premier Care have entry to InMotion Options for precisely this sort of customized server configuration work.

Shared internet hosting environments don’t assist customized ModSecurity guidelines on the account stage. If aggressive bot site visitors is an issue on shared internet hosting, the choices are restricted to .htaccess blocks or upgrading to a VPS or devoted server the place you might have full WAF configurability.

A Be aware on robots.txt

None of this replaces a well-structured robots.txt file. Holding crawl-delay directives in place for compliant bots stays worthwhile, and explicitly itemizing AI crawlers you need to prohibit provides a documented sign of intent, even when some bots ignore it. The ModSecurity guidelines deal with enforcement for those that gained’t self-regulate.

robots.txt for bots that respect conventions; ModSecurity charge limiting for those that don’t. The 2 layers work collectively.

Abstract

AI coaching crawlers don’t observe robots.txt the way in which conventional search bots do, and the mixed load from a number of simultaneous indexing operations can degrade server efficiency for reliable site visitors. ModSecurity’s Person-Agent-based charge limiting provides you server-side management over how ceaselessly these bots can request assets, with out requiring you to dam them from indexing your website completely.

The principles are easy to deploy, lengthen to any bot by copying the template, and supply express signaling by way of Retry-After headers for crawlers which might be able to honoring them.

Should you’re seeing unexplained spikes in server load or HTTP request quantity that don’t correlate with actual consumer site visitors, verify your entry logs for AI crawler Person-Brokers earlier than assuming you’re coping with one thing extra complicated.

Charge Limiting AI Crawler Bots with ModSecurity

Neglect the 1%. These CEOs Are within the 0.001% — and the Numbers Will Make Your Head Spin

Are you able to make it worse?

g6pm6

Related Posts

What It Means for Your Enterprise

Flip one piece of content material into 15+

The Proper Autoresponder for GreatLifeWorldwide: My Remaining 2

Selecting Internet hosting That Grows With Your Enterprise

Methods to Repurpose Lengthy-Kind Content material Into Quick-Kind Belongings

Are you able to make it worse?

Leave a Reply Cancel reply

Premium Content

Stroll away or dance | Seth’s Weblog

Improve Income With This One Small Step

Are You Charging Sufficient? A Fast Information to Pricing What You Promote

Browse by Category

IdeasToMakeMoneyToday

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

Charge Limiting AI Crawler Bots with ModSecurity

The Drawback: AI Bots That Don’t Comply with the Guidelines

Two Choices, One Clear Tradeoff

How ModSecurity Charge Limiting Works

The ModSecurity Guidelines

GPTBot (OpenAI)

ClaudeBot (Anthropic)

Amazonbot

Adapting the Guidelines for Different Bots

What Occurs After You Deploy

Conditions and The place to Apply These Guidelines

A Be aware on robots.txt

Abstract

Neglect the 1%. These CEOs Are within the 0.001% — and the Numbers Will Make Your Head Spin

Are you able to make it worse?

Related Posts

Leave a Reply Cancel reply

Premium Content

Browse by Category

Browse by Tags

IdeasToMakeMoneyToday

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?