Set up the crawler
Open Data Sources
Go to AI Training → Data Sources.
Add a Website source
Click Add source → Website. Paste the URL you want to crawl (e.g.
https://help.yourcompany.com).Set scope
Configure the crawl before it runs. Defaults work for most teams — adjust only when you have a reason.
| Field | Default | What it does |
|---|---|---|
| URL | — | Starting URL. The crawler follows links within the same domain. |
| Display name | Extracted from domain | Label shown in Data Sources. |
| Page limit | 100 | Maximum pages per crawl (1–5000). Set higher for large sites. |
| Include paths | empty | Whitelist regex. If set, only URLs matching are crawled. Use for locale or section scoping (e.g. */en-us/*). |
| Exclude paths | empty | Blacklist regex. Crawl skips matching URLs (e.g. /admin/*). Binary file extensions are always excluded. |
| Crawl interval (hours) | 168 (7 days) | Auto re-crawl cadence. Minimum 24. Set to null for manual-only. |
| Auto-start crawl | true | Runs the first crawl immediately on save. |
Watch the first crawl
The datasource row shows live counters: total pages, new, updated, removed, unchanged. First crawls on a large help center can take 10–30 minutes.
Verify in AI Instructions
Open AI Training → AI Instructions. Crawled pages appear under the datasource name. Spot-check a known article.
Scoping with include/exclude paths
The single most common setup mistake: letting the crawler index your marketing funnel, blog, or admin paths alongside your help center. Use include/exclude paths deliberately.Restrict to a single language
Restrict to a single language
Scenario: Your help center has
/en-us/, /fr-fr/, /de-de/, and the AI only replies in English.Set Include paths to */en-us/*. Non-English pages are skipped. Change the value any time — the next crawl respects the new scope.Existing non-English pages from earlier crawls stay indexed. Remove them by navigating into the datasource and bulk-deleting them from the Pages view.
Skip sections that aren't help content
Skip sections that aren't help content
Scenario:
help.acme.com also serves /blog/, /careers/, /pricing/. None of those belong in the AI index.Set Exclude paths to multiple regexes: /blog/*, /careers/*, /pricing/*. The crawl drops matches before extraction.Crawl only one category
Crawl only one category
Scenario: You want only the billing category indexed, not the whole help center.Set Include paths to the category root:
*/billing/*. Combine with a higher page_limit if the category is large.Page-level control
After the first crawl completes, open the datasource detail view. You get per-page controls:- Exclude page — the page stays in the list but won’t re-sync. Use for one-off stale content.
- Re-include page — undo the above.
- Delete page — remove from the AI index. Optionally also deletes the underlying training item.
- Resync page — force a fresh fetch + re-index for one URL.
Limits
| Value | |
|---|---|
| Streams synced | Markdown extraction of main content (navigation/chrome excluded automatically) |
| Sync mode | Incremental — SHA-256 content hash per page; unchanged pages skipped |
| Sync cadence | crawl_interval_hours (default 168 / 7 days, min 24, nullable for manual-only) |
| Credential type | None — public URLs only |
| Multi-domain | One datasource per domain. Add a second datasource for a different domain. |
| Max pages per crawl | 5000 (default 100) |
| JavaScript rendering | Firecrawl renders most SPAs, but client-only sites with heavy auth or lazy-loaded content may under-capture. Spot-check after the first crawl. |
| Auth-gated content | Not supported |
Related Documentation
Website overview
What the crawler does and when to use it.
Troubleshooting
Stuck crawls, missing pages, dedup issues.
Crawl API
Full REST API for datasources, crawls, pages.
Connect a knowledge source
All sources compared.