Skip to main content
Start at the datasource row in Data Sources: it shows current status, last crawl timestamps, and an error message if the latest crawl failed.
For most issues, open the datasource detail view — the Crawl history section and the per-page list are where the signal is. Don’t delete and re-create the datasource before checking those.

Common scenarios

SymptomLikely causeFix
Crawl has been “In Progress” for hours, no counters movingFirecrawl pipeline stall or webhook delivery lostOpen the datasource → Cancel crawl, then Start crawl again. If it recurs, file a ticket with the datasource ID — the back-end crawl jobs are traceable.
AI cites non-English pages even after adding an English-only filterOld pages from before the filter change are still indexedOpen the datasource → Pages view → filter by language or URL pattern → bulk Delete. The next crawl respects the new filter going forward.
AI answers from blog/funnel pages that shouldn’t be in the indexexclude_paths is empty or too narrowAdd the section regex to Exclude paths (e.g. /blog/*, /pricing/*). Then bulk-delete existing indexed pages that match from the Pages view.
Some articles are missing after the crawlpage_limit hit (default 100), paths blocked by include_paths, or pages require JS rendering for contentRaise page_limit. Remove over-narrow include_paths. For JS-heavy pages, spot-check by hitting a single page’s Resync — if it comes back empty, the page isn’t renderable by the crawler and needs a different source.
AI surfaces the same article twiceBoth the crawler and a direct integration (Zendesk/Intercom/Front) are indexing the same contentPick one. Disconnect the other. The direct integration is almost always the better choice for Help Center content — the crawler stays for other domains.
Crawl completes but no pages syncedThe entry URL returned no links, or every link was blocked by include_paths / exclude_pathsOpen a browser to the URL and confirm the page links out to the content you expect. Review include_paths — overly specific regexes filter out everything.
None of the aboveContact support with the datasource ID, approximate time of the issue, and a sample URL that’s missing/wrong.

Limits at a glance

Value
Page limit per crawl15000 (default 100)
Minimum crawl interval24 hours
Default crawl interval168 hours (7 days)
Binary file typesAlways excluded
Auth-gated URLsNot supported
JavaScript renderingBest effort via Firecrawl — not guaranteed for heavy SPAs

Connect a website

Re-check URL, limits, include/exclude paths.

Website overview

When to use the crawler vs a direct integration.

Crawl API

Programmatic exclude/include/resync.

Connect a knowledge source

Switch to a direct integration if one fits.