# Pushly Crawler

### User agent

All crawler requests use the following user agent:

```
PushlyCrawler/1.0 (+https://documentation.pushly.com/faq/crawler)
```

We recommend allowing this user agent in any bot protection, WAF, or CDN rules you maintain for your site.

### IP addresses

The Pushly crawler originates traffic from the following IP addresses:

```
52.0.101.14
34.213.44.134
```

If you require IP-based allowlisting, you can allow these IPs in your firewall / WAF configuration.

If your organization requires stricter controls, we recommend allowing either:

* The user agent `PushlyCrawler/1.0`, or
* The IPs `52.0.101.14` and `34.213.44.134`, or
* Both, for maximum reliability

***

### What the crawler does

The Pushly crawler:

* Fetches HTML content from your site to power content suggestions inside the Pushly platform
* Only accesses pages that are relevant to your Pushly configuration and in-product features
* Does not index your content for public search engines
* Does not execute ad-blocking or attempt to bypass your users’ ad blockers
* Runs as a server-side integration, not from end-user browsers

If you ever want to stop the crawler completely, you can block either the user agent `PushlyCrawler/1.0` or the IPs listed above.

***

### robots.txt behavior

Because this crawler exists to provide first-party, in-platform functionality for your own authenticated users (your internal staff using Pushly), it does not treat `robots.txt` as an access-control mechanism.

* Changes to `robots.txt` will not stop the crawler
* To block or allow the crawler, use user-agent or IP-based rules instead

This design avoids situations where a broad `Disallow: /` robots rule intended for public search engines accidentally breaks your internal Pushly workflows.

***

### Allowlisting examples

The following examples show how you can allow the Pushly crawler in common infrastructure setups. You should adapt these patterns to your specific configuration and security policies.

#### Cloudflare (WAF custom rule)

You can create a WAF rule that allows the Pushly crawler based on user agent and/or IP.

Example rule expression:

```
(http.user_agent contains "PushlyCrawler/1.0")
or
(ip.src in {52.0.101.14 34.213.44.134})
```

Actions you might configure:

* Allow or Skip (bypass) security features such as JS challenges or bot checks for matching requests
* Apply the rule only to specific hostnames or paths if needed

#### Akamai (WAF / security configuration)

In Akamai, you can create a match condition for the Pushly crawler:

* Match on Header: `User-Agent` contains `PushlyCrawler/1.0`
* Optionally also match on source IP being in:
  * `52.0.101.14`
  * `34.213.44.134`

Then:

* Assign a lower-threat or allow action to this traffic
* Exclude it from aggressive bot or JS challenge rules

Refer to your Akamai property configuration UI for the exact steps, as naming and structure may vary between setups.

#### Fastly (VCL example)

In Fastly, you can use custom VCL to identify and allow the Pushly crawler.

Example snippet:

```vcl
if (req.http.User-Agent ~ "PushlyCrawler/1.0" ||
    client.ip == 52.0.101.14 ||
    client.ip == 34.213.44.134) {
  # Bypass rate limiting / bot protection logic here
  # For example:
  return (pass);
}
```

You can also limit this to specific services or hostnames if desired.

#### Nginx (example configuration)

You can use Nginx to allow the Pushly crawler via user agent and/or IP.

Example:

```nginx
map $http_user_agent $is_pushly_crawler {
    default                          0;
    "~PushlyCrawler/1\.0"           1;
}

server {
    listen 80;
    server_name example.com;

    # Optionally, only on specific locations
    location / {
        # Allow by IP
        allow 52.0.101.14;
        allow 34.213.44.134;

        # Optionally allow by UA
        if ($is_pushly_crawler) {
            set $pushly_allowed 1;
        }

        # Example: bypass certain protections
        if ($pushly_allowed) {
            # Your custom logic here (e.g. skip rate limiting)
        }

        # Your normal config
        try_files $uri $uri/ /index.html;
    }
}
```

Adjust this config to match your existing security and routing setup.

***

### Troubleshooting

If Pushly reports that it cannot fetch your content or you see 403 responses being returned to the crawler:

1. Verify that your WAF or bot protection is not blocking:
   * User agent: `PushlyCrawler/1.0`
   * IPs: `52.0.101.14`, `34.213.44.134`
2. Check for rules that:
   * Enforce JS challenges
   * Block “unknown bots”
   * Challenge or block based on reputation of the IPs above
3. Add an explicit allow/skip rule for:
   * The user agent string, and/or
   * The IP addresses

If issues persist, please contact Pushly support and include:

* Example URLs that are failing
* Timestamps (with timezone)
* Any relevant WAF or firewall logs showing blocked requests from the crawler IPs or UA


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.pushly.com/faq/pushly-crawler.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
