feat: add robots2policy CLI to convert robots.txt to Anubis CEL (#657)

* feat: add robots2policy CLI utility to convert robots.txt to Anubis challenge policies * feat: add documentation for robots2policy CLI tool * feat: implement crawl delay handling as weight adjustment in Anubis rules * feat: add various robots.txt and YAML configurations for user agent handling and crawl delays * test: add comprehensive tests for robots2policy conversion and parsing * fix: update example URL in usage instructions for robots2policy CLI * Update metadata check-spelling run (pull_request) for json/robots2policycli Signed-off-by: check-spelling-bot <check-spelling-bot@users.noreply.github.com> on-behalf-of: @check-spelling <check-spelling-bot@check-spelling.dev> * docs: add crawl delay weight adjustment and deny user agents option to robots2policy CLI * Update cmd/robots2policy/main.go Co-authored-by: Xe Iaso <me@xeiaso.net> Signed-off-by: Jason Cameron <jasoncameron.all@gmail.com> * Update cmd/robots2policy/main.go Co-authored-by: Xe Iaso <me@xeiaso.net> Signed-off-by: Jason Cameron <jasoncameron.all@gmail.com> * fix(robots2policy): use sigs.k8s.io/yaml Signed-off-by: Xe Iaso <me@xeiaso.net> * feat(config): properly marshal bot policy rules Signed-off-by: Xe Iaso <me@xeiaso.net> * chore(yeetfile): expose robots2policy in libexec Signed-off-by: Xe Iaso <me@xeiaso.net> * fix(yeetfile): put robots2policy in $PATH Signed-off-by: Xe Iaso <me@xeiaso.net> * Update metadata check-spelling run (pull_request) for json/robots2policycli Signed-off-by: check-spelling-bot <check-spelling-bot@users.noreply.github.com> on-behalf-of: @check-spelling <check-spelling-bot@check-spelling.dev> * style: reorder imports * refactor: use preexisting structs in config * fix: correct flag check in main function * fix: reorder fields in AnubisRule struct for better alignment * style: improve alignment of struct fields in AnubisRule and OGTagCache * Update metadata check-spelling run (pull_request) for json/robots2policycli Signed-off-by: check-spelling-bot <check-spelling-bot@users.noreply.github.com> on-behalf-of: @check-spelling <check-spelling-bot@check-spelling.dev> * fix: add validation for generated Anubis rules from robots.txt * feat: add batch processing for robots.txt files to generate Anubis CEL policies * fix: improve usage message and error handling for input file requirement * refactor: update AnubisRule structure to use ExpressionOrList for improved expression handling * refactor: reorganize policy definitions in YAML files for consistency and clarity * fix: correct indentation in blacklist and complex YAML files for consistency * test: enhance output comparison in robots2policy tests for YAML and JSON formats * Revert "fix: improve usage message and error handling for input file requirement" This reverts commit ddcde1f2a326545d3ef2ec32e5e03f55f4f931a8. * fix: improve usage message and error handling in robots2policy Signed-off-by: Jason Cameron <git@jasoncameron.dev> --------- Signed-off-by: check-spelling-bot <check-spelling-bot@users.noreply.github.com> Signed-off-by: Jason Cameron <jasoncameron.all@gmail.com> Signed-off-by: Xe Iaso <me@xeiaso.net> Signed-off-by: Jason Cameron <git@jasoncameron.dev> Co-authored-by: Xe Iaso <me@xeiaso.net>
2025-06-14 23:41:00 -04:00 · 2025-06-14 23:41:00 -04:00 · e0781e4560
commit e0781e4560
parent 7a195f1595
28 changed files with 1302 additions and 27 deletions
--- a/docs/docs/admin/robots2policy.mdx
+++ b/docs/docs/admin/robots2policy.mdx
@ -0,0 +1,84 @@
+---
+title: robots2policy CLI Tool
+sidebar_position: 50
+---
+
+The `robots2policy` tool converts robots.txt files into Anubis challenge policies. It reads robots.txt rules and generates equivalent CEL expressions for path matching and user-agent filtering.
+
+## Installation
+
+Install directly with Go:
+
+```bash
+go install github.com/TecharoHQ/anubis/cmd/robots2policy@latest
+```
+## Usage
+
+Basic conversion from URL:
+
+```bash
+robots2policy -input https://www.example.com/robots.txt
+```
+
+Convert local file to YAML:
+
+```bash
+robots2policy -input robots.txt -output policy.yaml
+```
+
+Convert with custom settings:
+
+```bash
+robots2policy -input robots.txt -action DENY -format json
+```
+
+## Options
+
+| Flag                  | Description                                                        | Default             |
+|-----------------------|--------------------------------------------------------------------|---------------------|
+| `-input`              | robots.txt file path or URL (use `-` for stdin)                    | *required*          |
+| `-output`             | Output file (use `-` for stdout)                                   | stdout              |
+| `-format`             | Output format: `yaml` or `json`                                    | `yaml`              |
+| `-action`             | Action for disallowed paths: `ALLOW`, `DENY`, `CHALLENGE`, `WEIGH` | `CHALLENGE`         |
+| `-name`               | Policy name prefix                                                 | `robots-txt-policy` |
+| `-crawl-delay-weight` | Weight adjustment for crawl-delay rules                            | `3`                 |
+| `-deny-user-agents`   | Action for blacklisted user agents                                 | `DENY`              |
+
+## Example
+
+Input robots.txt:
+```txt
+User-agent: *
+Disallow: /admin/
+Disallow: /private
+
+User-agent: BadBot
+Disallow: /
+```
+
+Generated policy:
+```yaml
+- name: robots-txt-policy-disallow-1
+  action: CHALLENGE
+  expression:
+    single: path.startsWith("/admin/")
+- name: robots-txt-policy-disallow-2
+  action: CHALLENGE
+  expression:
+    single: path.startsWith("/private")
+- name: robots-txt-policy-blacklist-3
+  action: DENY
+  expression:
+    single: userAgent.contains("BadBot")
+```
+
+## Using the Generated Policy
+
+Save the output and import it in your main policy file:
+
+```yaml
+import:
+  - path: "./robots-policy.yaml"
+```
+
+The tool handles wildcard patterns, user-agent specific rules, and blacklisted bots automatically.