lib: move config to yaml (#307)

* lib: move config to yaml

Signed-off-by: Xe Iaso <me@xeiaso.net>

* web: run go generate

Signed-off-by: Xe Iaso <me@xeiaso.net>

* Add Haiku to known instances (#304)

Signed-off-by: Asmodeus <46908100+AsmodeumX@users.noreply.github.com>

* Add headers bot rule (#300)

* Closes #291: add headers support to bot policy rules

* Fix config validator

* update docs for JSON -> YAML

Signed-off-by: Xe Iaso <me@xeiaso.net>

* docs: document http header based actions

Signed-off-by: Xe Iaso <me@xeiaso.net>

* lib: add missing test

Signed-off-by: Xe Iaso <me@xeiaso.net>

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Xe Iaso <me@xeiaso.net>

---------

Signed-off-by: Xe Iaso <me@xeiaso.net>
Signed-off-by: Asmodeus <46908100+AsmodeumX@users.noreply.github.com>
Co-authored-by: Asmodeus <46908100+AsmodeumX@users.noreply.github.com>
Co-authored-by: Neur0toxine <pashok9825@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
Xe Iaso 2025-04-20 20:09:27 -04:00 committed by GitHub
parent 022eb59ff3
commit d40b5cfdab
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
22 changed files with 854 additions and 19 deletions

View file

@ -12,13 +12,13 @@ services:
METRICS_BIND: ":9090"
SERVE_ROBOTS_TXT: "true"
TARGET: "http://nginx"
POLICY_FNAME: "/data/cfg/botPolicy.json"
POLICY_FNAME: "/data/cfg/botPolicy.yaml"
OG_PASSTHROUGH: "true"
OG_EXPIRY_TIME: "24h"
ports:
- 8080:8080
volumes:
- "./botPolicy.json:/data/cfg/botPolicy.json:ro"
- "./botPolicy.yaml:/data/cfg/botPolicy.yaml:ro"
nginx:
image: nginx
volumes:

View file

@ -62,7 +62,7 @@ Anubis uses these environment variables for configuration:
| `METRICS_BIND_NETWORK` | `tcp` | The address family that the Anubis metrics server listens on. See `BIND_NETWORK` for more information. |
| `OG_EXPIRY_TIME` | `24h` | The expiration time for the Open Graph tag cache. |
| `OG_PASSTHROUGH` | `false` | If set to `true`, Anubis will enable Open Graph tag passthrough. |
| `POLICY_FNAME` | unset | The file containing [bot policy configuration](./policies.md). See the bot policy documentation for more details. If unset, the default bot policy configuration is used. |
| `POLICY_FNAME` | unset | The file containing [bot policy configuration](./policies.mdx). See the bot policy documentation for more details. If unset, the default bot policy configuration is used. |
| `SERVE_ROBOTS_TXT` | `false` | If set `true`, Anubis will serve a default `robots.txt` file that disallows all known AI scrapers by name and then additionally disallows every scraper. This is useful if facts and circumstances make it difficult to change the underlying service to serve such a `robots.txt` file. |
| `SOCKET_MODE` | `0770` | _Only used when at least one of the `*_BIND_NETWORK` variables are set to `unix`._ The socket mode (permissions) for Unix domain sockets. |
| `TARGET` | `http://localhost:3923` | The URL of the service that Anubis should forward valid requests to. Supports Unix domain sockets, set this to a URI like so: `unix:///path/to/socket.sock`. |

View file

@ -86,20 +86,20 @@ Once it's installed, make a copy of the default configuration file `/etc/anubis/
sudo cp /etc/anubis/default.env /etc/anubis/gitea.env
```
Copy the default bot policies file to `/etc/anubis/gitea.botPolicies.json`:
Copy the default bot policies file to `/etc/anubis/gitea.botPolicies.yaml`:
<Tabs>
<TabItem value="debrpm" label="Debian or Red Hat" default>
```text
sudo cp /usr/share/doc/anubis/botPolicies.json /etc/anubis/gitea.botPolicies.json
sudo cp /usr/share/doc/anubis/botPolicies.yaml /etc/anubis/gitea.botPolicies.yaml
```
</TabItem>
<TabItem value="tarball" label="Tarball">
```text
sudo cp ./doc/botPolicies.json /etc/anubis/gitea.botPolicies.json
sudo cp ./doc/botPolicies.yaml /etc/anubis/gitea.botPolicies.yaml
```
</TabItem>
@ -114,7 +114,7 @@ BIND_NETWORK=tcp
DIFFICULTY=4
METRICS_BIND=[::1]:8240
METRICS_BIND_NETWORK=tcp
POLICY_FNAME=/etc/anubis/gitea.botPolicies.json
POLICY_FNAME=/etc/anubis/gitea.botPolicies.yaml
TARGET=http://localhost:3000
```

View file

@ -2,15 +2,24 @@
title: Policy Definitions
---
import Tabs from "@theme/Tabs";
import TabItem from "@theme/TabItem";
Out of the box, Anubis is pretty heavy-handed. It will aggressively challenge everything that might be a browser (usually indicated by having `Mozilla` in its user agent). However, some bots are smart enough to get past the challenge. Some things that look like bots may actually be fine (IE: RSS readers). Some resources need to be visible no matter what. Some resources and remotes are fine to begin with.
Bot policies let you customize the rules that Anubis uses to allow, deny, or challenge incoming requests. Currently you can set policies by the following matches:
- Request path
- User agent string
- HTTP request header values
As of version v1.17.0 or later, configuration can be written in either JSON or YAML.
Here's an example rule that denies [Amazonbot](https://developer.amazon.com/en/amazonbot):
<Tabs>
<TabItem value="json" label="JSON" default>
```json
{
"name": "amazonbot",
@ -19,15 +28,37 @@ Here's an example rule that denies [Amazonbot](https://developer.amazon.com/en/a
}
```
</TabItem>
<TabItem value="yaml" label="YAML">
```yaml
- name: amazonbot
user_agent_regex: Amazonbot
action: DENY
```
</TabItem>
</Tabs>
When this rule is evaluated, Anubis will check the `User-Agent` string of the request. If it contains `Amazonbot`, Anubis will send an error page to the user saying that access is denied, but in such a way that makes scrapers think they have correctly loaded the webpage.
Right now the only kinds of policies you can write are bot policies. Other forms of policies will be added in the future.
Here is a minimal policy file that will protect against most scraper bots:
<Tabs>
<TabItem value="json" label="JSON" default>
```json
{
"bots": [
{
"name": "cloudflare-workers",
"headers_regex": {
"CF-Worker": ".*"
},
"action": "DENY"
},
{
"name": "well-known",
"path_regex": "^/.well-known/.*$",
@ -52,6 +83,32 @@ Here is a minimal policy file that will protect against most scraper bots:
}
```
</TabItem>
<TabItem value="yaml" label="YAML">
```yaml
bots:
- name: cloudflare-workers
headers_regex:
CF-Worker: .*
action: DENY
- name: well-known
path_regex: ^/.well-known/.*$
action: ALLOW
- name: favicon
path_regex: ^/favicon.ico$
action: ALLOW
- name: robots-txt
path_regex: ^/robots.txt$
action: ALLOW
- name: generic-browser
user_agent_regex: Mozilla
action: CHALLENGE
```
</TabItem>
</Tabs>
This allows requests to [`/.well-known`](https://en.wikipedia.org/wiki/Well-known_URI), `/favicon.ico`, `/robots.txt`, and challenges any request that has the word `Mozilla` in its User-Agent string. The [default policy file](https://github.com/TecharoHQ/anubis/blob/main/data/botPolicies.json) is a bit more cohesive, but this should be more than enough for most users.
If no rules match the request, it is allowed through.
@ -72,6 +129,9 @@ Name your rules in lower case using kebab-case. Rule names will be exposed in Pr
Rules can also have their own challenge settings. These are customized using the `"challenge"` key. For example, here is a rule that makes challenges artificially hard for connections with the substring "bot" in their user agent:
<Tabs>
<TabItem value="json" label="JSON" default>
```json
{
"name": "generic-bot-catchall",
@ -85,6 +145,23 @@ Rules can also have their own challenge settings. These are customized using the
}
```
</TabItem>
<TabItem value="yaml" label="YAML">
```yaml
# Punish any bot with "bot" in the user-agent string
- name: generic-bot-catchall
user_agent_regex: (?i:bot|crawler)
action: CHALLENGE
challenge:
difficulty: 16 # impossible
report_as: 4 # lie to the operator
algorithm: slow # intentionally waste CPU cycles and time
```
</TabItem>
</Tabs>
Challenges can be configured with these settings:
| Key | Example | Description |
@ -99,6 +176,9 @@ The `remote_addresses` field of a Bot rule allows you to set the IP range that t
For example, you can allow a search engine to connect if and only if its IP address matches the ones they published:
<Tabs>
<TabItem value="json" label="JSON" default>
```json
{
"name": "qwantbot",
@ -108,8 +188,25 @@ For example, you can allow a search engine to connect if and only if its IP addr
}
```
</TabItem>
<TabItem value="yaml" label="YAML">
```yaml
- name: qwantbot
user_agent_regex: \+https\://help\.qwant\.com/bot/
action: ALLOW
# https://help.qwant.com/wp-content/uploads/sites/2/2025/01/qwantbot.json
remote_addresses: ["91.242.162.0/24"]
```
</TabItem>
</Tabs>
This also works at an IP range level without any other checks:
<Tabs>
<TabItem value="json" label="JSON" default>
```json
{
"name": "internal-network",
@ -118,6 +215,19 @@ This also works at an IP range level without any other checks:
}
```
</TabItem>
<TabItem value="yaml" label="YAML">
```yaml
name: internal-network
action: ALLOW
remote_addresses:
- 100.64.0.0/10
```
</TabItem>
</Tabs>
## Risk calculation for downstream services
In case your service needs it for risk calculation reasons, Anubis exposes information about the rules that any requests match using a few headers: