Split up AI filtering files (#592)

* Split up AI filtering files Create aggressive/moderate/permissive policies to allow administrators to choose their AI/LLM stance. Aggressive policy matches existing default in Anubis. Removes `Google-Extended` flag from `ai-robots-txt.yaml` as it doesn't exist in requests. Rename `ai-robots-txt.yaml` to `ai-catchall.yaml` as the file is no longer a copy of the source repo/file. * chore: spelling * chore: fix embeds * chore: fix data includes * chore: fix file name typo * chore: Ignore READMEs in configs * chore(lib/policy/config): go tool goimports -w Signed-off-by: Xe Iaso <me@xeiaso.net> --------- Signed-off-by: Xe Iaso <me@xeiaso.net> Co-authored-by: Xe Iaso <me@xeiaso.net>
2025-06-01 13:21:18 -07:00 · 2025-06-01 13:21:18 -07:00 · de7dbfe6d6
commit de7dbfe6d6
parent 77e0bbbce9
19 changed files with 107 additions and 18 deletions
--- a/data/crawlers/ai-search.yaml
+++ b/data/crawlers/ai-search.yaml
@ -0,0 +1,8 @@
+# User agents that index exclusively for search in for AI systems.
+# Each entry should have a positive/ALLOW entry created as well, with further documentation.
+# Exceptions:
+#  - Claude-SearchBot: No published IP allowlist
+- name: "ai-crawlers-search"
+  user_agent_regex: >-
+    OAI-SearchBot|Claude-SearchBot
+  action: DENY
--- a/data/crawlers/ai-training.yaml
+++ b/data/crawlers/ai-training.yaml
@ -0,0 +1,8 @@
+# User agents that crawl for training AI/LLM systems
+# Each entry should have a positive/ALLOW entry created as well, with further documentation.
+# Exceptions:
+#  - ClaudeBot: No published IP allowlist
+- name: "ai-crawlers-training"
+  user_agent_regex: >-
+    GPTBot|ClaudeBot
+  action: DENY