docs: remove JSON examples from policy file docs (#945)

* docs: remove JSON examples from policy file docs

Signed-off-by: Xe Iaso <me@xeiaso.net>

* fix(lib): remove mentions of botPolicies.json in the tests

Signed-off-by: Xe Iaso <me@xeiaso.net>

* docs: update link to challenge methods

Signed-off-by: Xe Iaso <me@xeiaso.net>

* docs: unbreak links to the challenges category

Signed-off-by: Xe Iaso <me@xeiaso.net>

---------

Signed-off-by: Xe Iaso <me@xeiaso.net>
This commit is contained in:
Xe Iaso 2025-08-03 14:09:26 -04:00 committed by GitHub
parent 2d8e942377
commit 7c80c23e90
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
16 changed files with 86 additions and 301 deletions

View file

@ -7,6 +7,10 @@ import TabItem from "@theme/TabItem";
Out of the box, Anubis is pretty heavy-handed. It will aggressively challenge everything that might be a browser (usually indicated by having `Mozilla` in its user agent). However, some bots are smart enough to get past the challenge. Some things that look like bots may actually be fine (IE: RSS readers). Some resources need to be visible no matter what. Some resources and remotes are fine to begin with.
Anubis lets you customize its configuration with a Policy File. This is a YAML document that spells out what actions Anubis should take when evaluating requests. The [default configuration](https://github.com/TecharoHQ/anubis/blob/main/data/botPolicies.yaml) explains everything, but this page contains an overview of everything you can do with it.
## Bot Policies
Bot policies let you customize the rules that Anubis uses to allow, deny, or challenge incoming requests. Currently you can set policies by the following matches:
- Request path
@ -18,75 +22,18 @@ As of version v1.17.0 or later, configuration can be written in either JSON or Y
Here's an example rule that denies [Amazonbot](https://developer.amazon.com/en/amazonbot):
<Tabs>
<TabItem value="json" label="JSON" default>
```json
{
"name": "amazonbot",
"user_agent_regex": "Amazonbot",
"action": "DENY"
}
```
</TabItem>
<TabItem value="yaml" label="YAML">
```yaml
- name: amazonbot
user_agent_regex: Amazonbot
action: DENY
```
</TabItem>
</Tabs>
When this rule is evaluated, Anubis will check the `User-Agent` string of the request. If it contains `Amazonbot`, Anubis will send an error page to the user saying that access is denied, but in such a way that makes scrapers think they have correctly loaded the webpage.
Right now the only kinds of policies you can write are bot policies. Other forms of policies will be added in the future.
Here is a minimal policy file that will protect against most scraper bots:
<Tabs>
<TabItem value="json" label="JSON" default>
```json
{
"bots": [
{
"name": "cloudflare-workers",
"headers_regex": {
"CF-Worker": ".*"
},
"action": "DENY"
},
{
"name": "well-known",
"path_regex": "^/.well-known/.*$",
"action": "ALLOW"
},
{
"name": "favicon",
"path_regex": "^/favicon.ico$",
"action": "ALLOW"
},
{
"name": "robots-txt",
"path_regex": "^/robots.txt$",
"action": "ALLOW"
},
{
"name": "generic-browser",
"user_agent_regex": "Mozilla",
"action": "CHALLENGE"
}
]
}
```
</TabItem>
<TabItem value="yaml" label="YAML">
```yaml
bots:
- name: cloudflare-workers
@ -107,22 +54,20 @@ bots:
action: CHALLENGE
```
</TabItem>
</Tabs>
This allows requests to [`/.well-known`](https://en.wikipedia.org/wiki/Well-known_URI), `/favicon.ico`, `/robots.txt`, and challenges any request that has the word `Mozilla` in its User-Agent string. The [default policy file](https://github.com/TecharoHQ/anubis/blob/main/data/botPolicies.json) is a bit more cohesive, but this should be more than enough for most users.
If no rules match the request, it is allowed through. For more details on this default behavior and its implications, see [Default allow behavior](./default-allow-behavior.mdx).
## Writing your own rules
### Writing your own rules
There are three actions that can be returned from a rule:
There are four actions that can be returned from a rule:
| Action | Effects |
| :---------- | :-------------------------------------------------------------------------------- |
| `ALLOW` | Bypass all further checks and send the request to the backend. |
| `DENY` | Deny the request and send back an error message that scrapers think is a success. |
| `CHALLENGE` | Show a challenge page and/or validate that clients have passed a challenge. |
| Action | Effects |
| :---------- | :---------------------------------------------------------------------------------------------------------------------------------- |
| `ALLOW` | Bypass all further checks and send the request to the backend. |
| `DENY` | Deny the request and send back an error message that scrapers think is a success. |
| `CHALLENGE` | Show a challenge page and/or validate that clients have passed a challenge. |
| `WEIGH` | Change the [request weight](#request-weight) for this request. See the [request weight](#request-weight) docs for more information. |
Name your rules in lower case using kebab-case. Rule names will be exposed in Prometheus metrics.
@ -130,27 +75,6 @@ Name your rules in lower case using kebab-case. Rule names will be exposed in Pr
Rules can also have their own challenge settings. These are customized using the `"challenge"` key. For example, here is a rule that makes challenges artificially hard for connections with the substring "bot" in their user agent:
<Tabs>
<TabItem value="json" label="JSON" default>
This rule has been known to have a high false positive rate in testing. Please use this with care.
```json
{
"name": "generic-bot-catchall",
"user_agent_regex": "(?i:bot|crawler)",
"action": "CHALLENGE",
"challenge": {
"difficulty": 16,
"report_as": 4,
"algorithm": "slow"
}
}
```
</TabItem>
<TabItem value="yaml" label="YAML">
This rule has been known to have a high false positive rate in testing. Please use this with care.
```yaml
@ -164,16 +88,13 @@ This rule has been known to have a high false positive rate in testing. Please u
algorithm: slow # intentionally waste CPU cycles and time
```
</TabItem>
</Tabs>
Challenges can be configured with these settings:
| Key | Example | Description |
| :----------- | :------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `difficulty` | `4` | The challenge difficulty (number of leading zeros) for proof-of-work. See [Why does Anubis use Proof-of-Work?](/docs/design/why-proof-of-work) for more details. |
| `report_as` | `4` | What difficulty the UI should report to the user. Useful for messing with industrial-scale scraping efforts. |
| `algorithm` | `"fast"` | The algorithm used on the client to run proof-of-work calculations. This must be set to `"fast"` or `"slow"`. See [Proof-of-Work Algorithm Selection](./algorithm-selection) for more details. |
| Key | Example | Description |
| :----------- | :------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `difficulty` | `4` | The challenge difficulty (number of leading zeros) for proof-of-work. See [Why does Anubis use Proof-of-Work?](/docs/design/why-proof-of-work) for more details. |
| `report_as` | `4` | What difficulty the UI should report to the user. Useful for messing with industrial-scale scraping efforts. |
| `algorithm` | `"fast"` | The challenge method to use. See [the list of challenge methods](./configuration/challenges/) for more information. |
### Remote IP based filtering
@ -181,21 +102,6 @@ The `remote_addresses` field of a Bot rule allows you to set the IP range that t
For example, you can allow a search engine to connect if and only if its IP address matches the ones they published:
<Tabs>
<TabItem value="json" label="JSON" default>
```json
{
"name": "qwantbot",
"user_agent_regex": "\\+https\\:\\/\\/help\\.qwant\\.com/bot/",
"action": "ALLOW",
"remote_addresses": ["91.242.162.0/24"]
}
```
</TabItem>
<TabItem value="yaml" label="YAML">
```yaml
- name: qwantbot
user_agent_regex: \+https\://help\.qwant\.com/bot/
@ -204,25 +110,8 @@ For example, you can allow a search engine to connect if and only if its IP addr
remote_addresses: ["91.242.162.0/24"]
```
</TabItem>
</Tabs>
This also works at an IP range level without any other checks:
<Tabs>
<TabItem value="json" label="JSON" default>
```json
{
"name": "internal-network",
"action": "ALLOW",
"remote_addresses": ["100.64.0.0/10"]
}
```
</TabItem>
<TabItem value="yaml" label="YAML">
```yaml
name: internal-network
action: ALLOW
@ -230,9 +119,6 @@ remote_addresses:
- 100.64.0.0/10
```
</TabItem>
</Tabs>
## Imprint / Impressum support
Anubis has support for showing imprint / impressum information. This is defined in the `impressum` block of your configuration. See [Imprint / Impressum configuration](./configuration/impressum.mdx) for more information.