lib: move config to yaml (#307)

* lib: move config to yaml Signed-off-by: Xe Iaso <me@xeiaso.net> * web: run go generate Signed-off-by: Xe Iaso <me@xeiaso.net> * Add Haiku to known instances (#304) Signed-off-by: Asmodeus <46908100+AsmodeumX@users.noreply.github.com> * Add headers bot rule (#300) * Closes #291: add headers support to bot policy rules * Fix config validator * update docs for JSON -> YAML Signed-off-by: Xe Iaso <me@xeiaso.net> * docs: document http header based actions Signed-off-by: Xe Iaso <me@xeiaso.net> * lib: add missing test Signed-off-by: Xe Iaso <me@xeiaso.net> * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xe Iaso <me@xeiaso.net> --------- Signed-off-by: Xe Iaso <me@xeiaso.net> Signed-off-by: Asmodeus <46908100+AsmodeumX@users.noreply.github.com> Co-authored-by: Asmodeus <46908100+AsmodeumX@users.noreply.github.com> Co-authored-by: Neur0toxine <pashok9825@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-04-20 20:09:27 -04:00 · 2025-04-20 20:09:27 -04:00 · d40b5cfdab
commit d40b5cfdab
parent 022eb59ff3
22 changed files with 854 additions and 19 deletions
--- a/docs/docs/admin/policies.mdx
+++ b/docs/docs/admin/policies.mdx
@ -0,0 +1,241 @@
+---
+title: Policy Definitions
+---
+
+import Tabs from "@theme/Tabs";
+import TabItem from "@theme/TabItem";
+
+Out of the box, Anubis is pretty heavy-handed. It will aggressively challenge everything that might be a browser (usually indicated by having `Mozilla` in its user agent). However, some bots are smart enough to get past the challenge. Some things that look like bots may actually be fine (IE: RSS readers). Some resources need to be visible no matter what. Some resources and remotes are fine to begin with.
+
+Bot policies let you customize the rules that Anubis uses to allow, deny, or challenge incoming requests. Currently you can set policies by the following matches:
+
+- Request path
+- User agent string
+- HTTP request header values
+
+As of version v1.17.0 or later, configuration can be written in either JSON or YAML.
+
+Here's an example rule that denies [Amazonbot](https://developer.amazon.com/en/amazonbot):
+
+<Tabs>
+<TabItem value="json" label="JSON" default>
+
+```json
+{
+  "name": "amazonbot",
+  "user_agent_regex": "Amazonbot",
+  "action": "DENY"
+}
+```
+
+</TabItem>
+<TabItem value="yaml" label="YAML">
+
+```yaml
+- name: amazonbot
+  user_agent_regex: Amazonbot
+  action: DENY
+```
+
+</TabItem>
+</Tabs>
+
+When this rule is evaluated, Anubis will check the `User-Agent` string of the request. If it contains `Amazonbot`, Anubis will send an error page to the user saying that access is denied, but in such a way that makes scrapers think they have correctly loaded the webpage.
+
+Right now the only kinds of policies you can write are bot policies. Other forms of policies will be added in the future.
+
+Here is a minimal policy file that will protect against most scraper bots:
+
+<Tabs>
+<TabItem value="json" label="JSON" default>
+
+```json
+{
+  "bots": [
+    {
+      "name": "cloudflare-workers",
+      "headers_regex": {
+        "CF-Worker": ".*"
+      },
+      "action": "DENY"
+    },
+    {
+      "name": "well-known",
+      "path_regex": "^/.well-known/.*$",
+      "action": "ALLOW"
+    },
+    {
+      "name": "favicon",
+      "path_regex": "^/favicon.ico$",
+      "action": "ALLOW"
+    },
+    {
+      "name": "robots-txt",
+      "path_regex": "^/robots.txt$",
+      "action": "ALLOW"
+    },
+    {
+      "name": "generic-browser",
+      "user_agent_regex": "Mozilla",
+      "action": "CHALLENGE"
+    }
+  ]
+}
+```
+
+</TabItem>
+<TabItem value="yaml" label="YAML">
+
+```yaml
+bots:
+  - name: cloudflare-workers
+    headers_regex:
+      CF-Worker: .*
+    action: DENY
+  - name: well-known
+    path_regex: ^/.well-known/.*$
+    action: ALLOW
+  - name: favicon
+    path_regex: ^/favicon.ico$
+    action: ALLOW
+  - name: robots-txt
+    path_regex: ^/robots.txt$
+    action: ALLOW
+  - name: generic-browser
+    user_agent_regex: Mozilla
+    action: CHALLENGE
+```
+
+</TabItem>
+</Tabs>
+
+This allows requests to [`/.well-known`](https://en.wikipedia.org/wiki/Well-known_URI), `/favicon.ico`, `/robots.txt`, and challenges any request that has the word `Mozilla` in its User-Agent string. The [default policy file](https://github.com/TecharoHQ/anubis/blob/main/data/botPolicies.json) is a bit more cohesive, but this should be more than enough for most users.
+
+If no rules match the request, it is allowed through.
+
+## Writing your own rules
+
+There are three actions that can be returned from a rule:
+
+| Action      | Effects                                                                           |
+| :---------- | :-------------------------------------------------------------------------------- |
+| `ALLOW`     | Bypass all further checks and send the request to the backend.                    |
+| `DENY`      | Deny the request and send back an error message that scrapers think is a success. |
+| `CHALLENGE` | Show a challenge page and/or validate that clients have passed a challenge.       |
+
+Name your rules in lower case using kebab-case. Rule names will be exposed in Prometheus metrics.
+
+### Challenge configuration
+
+Rules can also have their own challenge settings. These are customized using the `"challenge"` key. For example, here is a rule that makes challenges artificially hard for connections with the substring "bot" in their user agent:
+
+<Tabs>
+<TabItem value="json" label="JSON" default>
+
+```json
+{
+  "name": "generic-bot-catchall",
+  "user_agent_regex": "(?i:bot|crawler)",
+  "action": "CHALLENGE",
+  "challenge": {
+    "difficulty": 16,
+    "report_as": 4,
+    "algorithm": "slow"
+  }
+}
+```
+
+</TabItem>
+<TabItem value="yaml" label="YAML">
+
+```yaml
+# Punish any bot with "bot" in the user-agent string
+- name: generic-bot-catchall
+  user_agent_regex: (?i:bot|crawler)
+  action: CHALLENGE
+  challenge:
+    difficulty: 16 # impossible
+    report_as: 4 # lie to the operator
+    algorithm: slow # intentionally waste CPU cycles and time
+```
+
+</TabItem>
+</Tabs>
+
+Challenges can be configured with these settings:
+
+| Key          | Example  | Description                                                                                                                                                                                    |
+| :----------- | :------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `difficulty` | `4`      | The challenge difficulty (number of leading zeros) for proof-of-work. See [Why does Anubis use Proof-of-Work?](/docs/design/why-proof-of-work) for more details.                               |
+| `report_as`  | `4`      | What difficulty the UI should report to the user. Useful for messing with industrial-scale scraping efforts.                                                                                   |
+| `algorithm`  | `"fast"` | The algorithm used on the client to run proof-of-work calculations. This must be set to `"fast"` or `"slow"`. See [Proof-of-Work Algorithm Selection](./algorithm-selection) for more details. |
+
+### Remote IP based filtering
+
+The `remote_addresses` field of a Bot rule allows you to set the IP range that this ruleset applies to.
+
+For example, you can allow a search engine to connect if and only if its IP address matches the ones they published:
+
+<Tabs>
+<TabItem value="json" label="JSON" default>
+
+```json
+{
+  "name": "qwantbot",
+  "user_agent_regex": "\\+https\\:\\/\\/help\\.qwant\\.com/bot/",
+  "action": "ALLOW",
+  "remote_addresses": ["91.242.162.0/24"]
+}
+```
+
+</TabItem>
+<TabItem value="yaml" label="YAML">
+
+```yaml
+- name: qwantbot
+  user_agent_regex: \+https\://help\.qwant\.com/bot/
+  action: ALLOW
+  # https://help.qwant.com/wp-content/uploads/sites/2/2025/01/qwantbot.json
+  remote_addresses: ["91.242.162.0/24"]
+```
+
+</TabItem>
+</Tabs>
+
+This also works at an IP range level without any other checks:
+
+<Tabs>
+<TabItem value="json" label="JSON" default>
+
+```json
+{
+  "name": "internal-network",
+  "action": "ALLOW",
+  "remote_addresses": ["100.64.0.0/10"]
+}
+```
+
+</TabItem>
+<TabItem value="yaml" label="YAML">
+
+```yaml
+name: internal-network
+action: ALLOW
+remote_addresses:
+  - 100.64.0.0/10
+```
+
+</TabItem>
+</Tabs>
+
+## Risk calculation for downstream services
+
+In case your service needs it for risk calculation reasons, Anubis exposes information about the rules that any requests match using a few headers:
+
+| Header            | Explanation                                          | Example          |
+| :---------------- | :--------------------------------------------------- | :--------------- |
+| `X-Anubis-Rule`   | The name of the rule that was matched                | `bot/lightpanda` |
+| `X-Anubis-Action` | The action that Anubis took in response to that rule | `CHALLENGE`      |
+| `X-Anubis-Status` | The status and how strict Anubis was in its checks   | `PASS-FULL`      |
+
+Policy rules are matched using [Go's standard library regular expressions package](https://pkg.go.dev/regexp). You can mess around with the syntax at [regex101.com](https://regex101.com), make sure to select the Golang option.