Split up AI filtering files (#592)

* Split up AI filtering files Create aggressive/moderate/permissive policies to allow administrators to choose their AI/LLM stance. Aggressive policy matches existing default in Anubis. Removes `Google-Extended` flag from `ai-robots-txt.yaml` as it doesn't exist in requests. Rename `ai-robots-txt.yaml` to `ai-catchall.yaml` as the file is no longer a copy of the source repo/file. * chore: spelling * chore: fix embeds * chore: fix data includes * chore: fix file name typo * chore: Ignore READMEs in configs * chore(lib/policy/config): go tool goimports -w Signed-off-by: Xe Iaso <me@xeiaso.net> --------- Signed-off-by: Xe Iaso <me@xeiaso.net> Co-authored-by: Xe Iaso <me@xeiaso.net>
2025-06-01 13:21:18 -07:00 · 2025-06-01 13:21:18 -07:00 · de7dbfe6d6
commit de7dbfe6d6
parent 77e0bbbce9
19 changed files with 107 additions and 18 deletions
--- a/data/bots/ai-catchall.yaml
+++ b/data/bots/ai-catchall.yaml
@ -0,0 +1,11 @@
+# Extensive list of AI-affiliated agents based on https://github.com/ai-robots-txt/ai.robots.txt
+# Add new/undocumented agents here. Where documentation exists, consider moving to dedicated policy files.
+# Notes on various agents:
+#  - Amazonbot: Well documented, but they refuse to state which agent collects training data.
+#  - anthropic-ai/Claude-Web: Undocumented by Anthropic. Possibly deprecated or hallucinations?
+#  - Perplexity*: Well documented, but they refuse to state which agent collects training data.
+# Warning: May contain user agents that _must_ be blocked in robots.txt, or the opt-out will have no effect.
+- name: "ai-catchall"
+  user_agent_regex: >-
+    AI2Bot|Ai2Bot-Dolma|aiHitBot|Amazonbot|anthropic-ai|Brightbot 1.0|Bytespider|CCBot|Claude-Web|cohere-ai|cohere-training-data-crawler|Cotoyogi|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Google-CloudVertexBot|GoogleOther|GoogleOther-Image|GoogleOther-Video|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo Bot|meta-externalagent|Meta-ExternalAgent|meta-externalfetcher|Meta-ExternalFetcher|NovaAct|omgili|omgilibot|Operator|PanguBot|Perplexity-User|PerplexityBot|PetalBot|QualifiedBot|Scrapy|SemrushBot-OCOB|SemrushBot-SWA|Sidetrade indexer bot|TikTokSpider|Timpibot|VelenPublicWebCrawler|Webzio-Extended|wpbot|YouBot
+  action: DENY