Common scenarios
| Symptom | Likely cause | Fix |
|---|---|---|
| Crawl has been “In Progress” for hours, no counters moving | Firecrawl pipeline stall or webhook delivery lost | Open the datasource → Cancel crawl, then Start crawl again. If it recurs, file a ticket with the datasource ID — the back-end crawl jobs are traceable. |
| AI cites non-English pages even after adding an English-only filter | Old pages from before the filter change are still indexed | Open the datasource → Pages view → filter by language or URL pattern → bulk Delete. The next crawl respects the new filter going forward. |
| AI answers from blog/funnel pages that shouldn’t be in the index | exclude_paths is empty or too narrow | Add the section regex to Exclude paths (e.g. /blog/*, /pricing/*). Then bulk-delete existing indexed pages that match from the Pages view. |
| Some articles are missing after the crawl | page_limit hit (default 100), paths blocked by include_paths, or pages require JS rendering for content | Raise page_limit. Remove over-narrow include_paths. For JS-heavy pages, spot-check by hitting a single page’s Resync — if it comes back empty, the page isn’t renderable by the crawler and needs a different source. |
| AI surfaces the same article twice | Both the crawler and a direct integration (Zendesk/Intercom/Front) are indexing the same content | Pick one. Disconnect the other. The direct integration is almost always the better choice for Help Center content — the crawler stays for other domains. |
| Crawl completes but no pages synced | The entry URL returned no links, or every link was blocked by include_paths / exclude_paths | Open a browser to the URL and confirm the page links out to the content you expect. Review include_paths — overly specific regexes filter out everything. |
| None of the above | — | Contact support with the datasource ID, approximate time of the issue, and a sample URL that’s missing/wrong. |
Limits at a glance
| Value | |
|---|---|
| Page limit per crawl | 1–5000 (default 100) |
| Minimum crawl interval | 24 hours |
| Default crawl interval | 168 hours (7 days) |
| Binary file types | Always excluded |
| Auth-gated URLs | Not supported |
| JavaScript rendering | Best effort via Firecrawl — not guaranteed for heavy SPAs |
Related Documentation
Connect a website
Re-check URL, limits, include/exclude paths.
Website overview
When to use the crawler vs a direct integration.
Crawl API
Programmatic exclude/include/resync.
Connect a knowledge source
Switch to a direct integration if one fits.