Skip to main content
Configure the crawler at AI Training → Data Sources in the OpenCX dashboard.

Set up the crawler

1

Open Data Sources

2

Add a Website source

Click Add source → Website. Paste the URL you want to crawl (e.g. https://help.yourcompany.com).
3

Set scope

Configure the crawl before it runs. Defaults work for most teams — adjust only when you have a reason.
FieldDefaultWhat it does
URLStarting URL. The crawler follows links within the same domain.
Display nameExtracted from domainLabel shown in Data Sources.
Page limit100Maximum pages per crawl (1–5000). Set higher for large sites.
Include pathsemptyWhitelist regex. If set, only URLs matching are crawled. Use for locale or section scoping (e.g. */en-us/*).
Exclude pathsemptyBlacklist regex. Crawl skips matching URLs (e.g. /admin/*). Binary file extensions are always excluded.
Crawl interval (hours)168 (7 days)Auto re-crawl cadence. Minimum 24. Set to null for manual-only.
Auto-start crawltrueRuns the first crawl immediately on save.
4

Watch the first crawl

The datasource row shows live counters: total pages, new, updated, removed, unchanged. First crawls on a large help center can take 10–30 minutes.
5

Verify in AI Instructions

Open AI Training → AI Instructions. Crawled pages appear under the datasource name. Spot-check a known article.

Scoping with include/exclude paths

The single most common setup mistake: letting the crawler index your marketing funnel, blog, or admin paths alongside your help center. Use include/exclude paths deliberately.
Scenario: Your help center has /en-us/, /fr-fr/, /de-de/, and the AI only replies in English.Set Include paths to */en-us/*. Non-English pages are skipped. Change the value any time — the next crawl respects the new scope.
Existing non-English pages from earlier crawls stay indexed. Remove them by navigating into the datasource and bulk-deleting them from the Pages view.
Scenario: help.acme.com also serves /blog/, /careers/, /pricing/. None of those belong in the AI index.Set Exclude paths to multiple regexes: /blog/*, /careers/*, /pricing/*. The crawl drops matches before extraction.
Scenario: You want only the billing category indexed, not the whole help center.Set Include paths to the category root: */billing/*. Combine with a higher page_limit if the category is large.

Page-level control

After the first crawl completes, open the datasource detail view. You get per-page controls:
  • Exclude page — the page stays in the list but won’t re-sync. Use for one-off stale content.
  • Re-include page — undo the above.
  • Delete page — remove from the AI index. Optionally also deletes the underlying training item.
  • Resync page — force a fresh fetch + re-index for one URL.
Bulk select multiple pages before triggering any of these.

Limits

Value
Streams syncedMarkdown extraction of main content (navigation/chrome excluded automatically)
Sync modeIncremental — SHA-256 content hash per page; unchanged pages skipped
Sync cadencecrawl_interval_hours (default 168 / 7 days, min 24, nullable for manual-only)
Credential typeNone — public URLs only
Multi-domainOne datasource per domain. Add a second datasource for a different domain.
Max pages per crawl5000 (default 100)
JavaScript renderingFirecrawl renders most SPAs, but client-only sites with heavy auth or lazy-loaded content may under-capture. Spot-check after the first crawl.
Auth-gated contentNot supported

Website overview

What the crawler does and when to use it.

Troubleshooting

Stuck crawls, missing pages, dedup issues.

Crawl API

Full REST API for datasources, crawls, pages.

Connect a knowledge source

All sources compared.