Connect a website

Configure the crawler at AI Training → Data Sources in the OpenCX dashboard.

Set up the crawler

Open Data Sources

Add a Website source

Click Add source → Website. Paste the URL you want to crawl (e.g. https://help.yourcompany.com).

Set scope

Configure the crawl before it runs. Defaults work for most teams — adjust only when you have a reason.

Field	Default	What it does
URL	—	Starting URL. The crawler follows links within the same domain.
Display name	Extracted from domain	Label shown in Data Sources.
Page limit	`100`	Maximum pages per crawl (1–5000). Set higher for large sites.
Include paths	empty	Whitelist regex. If set, only URLs matching are crawled. Use for locale or section scoping (e.g. `/en-us/`).
Exclude paths	empty	Blacklist regex. Crawl skips matching URLs (e.g. `/admin/*`). Binary file extensions are always excluded.
Crawl interval (hours)	`168` (7 days)	Auto re-crawl cadence. Minimum `24`. Set to null for manual-only.
Auto-start crawl	`true`	Runs the first crawl immediately on save.

Watch the first crawl

The datasource row shows live counters: total pages, new, updated, removed, unchanged. First crawls on a large help center can take 10–30 minutes.

Verify in AI Instructions

Open AI Training → AI Instructions. Crawled pages appear under the datasource name. Spot-check a known article.

Scoping with include/exclude paths

The single most common setup mistake: letting the crawler index your marketing funnel, blog, or admin paths alongside your help center. Use include/exclude paths deliberately.

Restrict to a single language

Scenario: Your help center has /en-us/, /fr-fr/, /de-de/, and the AI only replies in English.Set Include paths to */en-us/*. Non-English pages are skipped. Change the value any time — the next crawl respects the new scope.

Existing non-English pages from earlier crawls stay indexed. Remove them by navigating into the datasource and bulk-deleting them from the Pages view.

Skip sections that aren't help content

Scenario: help.acme.com also serves /blog/, /careers/, /pricing/. None of those belong in the AI index.Set Exclude paths to multiple regexes: /blog/*, /careers/*, /pricing/*. The crawl drops matches before extraction.

Crawl only one category

Scenario: You want only the billing category indexed, not the whole help center.Set Include paths to the category root: */billing/*. Combine with a higher page_limit if the category is large.

Page-level control

After the first crawl completes, open the datasource detail view. You get per-page controls:

Exclude page — the page stays in the list but won’t re-sync. Use for one-off stale content.
Re-include page — undo the above.
Delete page — remove from the AI index. Optionally also deletes the underlying training item.
Resync page — force a fresh fetch + re-index for one URL.

Bulk select multiple pages before triggering any of these.

Limits

	Value
Streams synced	Markdown extraction of main content (navigation/chrome excluded automatically)
Sync mode	Incremental — SHA-256 content hash per page; unchanged pages skipped
Sync cadence	`crawl_interval_hours` (default `168` / 7 days, min `24`, nullable for manual-only)
Credential type	None — public URLs only
Multi-domain	One datasource per domain. Add a second datasource for a different domain.
Max pages per crawl	5000 (default 100)
JavaScript rendering	Firecrawl renders most SPAs, but client-only sites with heavy auth or lazy-loaded content may under-capture. Spot-check after the first crawl.
Auth-gated content	Not supported

Website overview

What the crawler does and when to use it.

Troubleshooting

Stuck crawls, missing pages, dedup issues.

Crawl API

Full REST API for datasources, crawls, pages.

Connect a knowledge source

All sources compared.

Agent Knowledge

Main Knowledge Sources

Additional Knowledge Sources

Set up the crawler

Scoping with include/exclude paths

Page-level control

Limits

Website overview

Troubleshooting

Crawl API

Connect a knowledge source

​Set up the crawler

​Scoping with include/exclude paths

​Page-level control

​Limits

​Related Documentation

Website overview

Troubleshooting

Crawl API

Connect a knowledge source

Set up the crawler

Scoping with include/exclude paths

Page-level control

Limits

Related Documentation