> ## Documentation Index
> Fetch the complete documentation index at: https://docs.open.cx/llms.txt
> Use this file to discover all available pages before exploring further.

# Crawl

> The Crawl API allows you to programmatically crawl and index websites into your knowledge base.

The Crawl API allows you to programmatically crawl and index websites into your knowledge base.
This enables your AI agents to access and reference content from your website, documentation,
or any other web-based resources when responding to customer inquiries.

## Overview

Website crawling enables:

* **Automated Content Indexing** - Automatically extract and index content from websites
* **Knowledge Base Integration** - Crawled content is added directly to your knowledge base
* **Real-time Status Tracking** - Monitor crawl progress and completion status
* **Flexible Configuration** - Control include/exclude paths, page limits, and crawl intervals
* **Page Management** - Exclude, include, delete, or resync individual pages

## How It Works

1. **Create a Datasource** - Provide a website URL and configuration options
2. **Crawl Starts Automatically** - By default, a crawl begins immediately after creation
3. **Monitor Progress** - Check crawl status and track page processing
4. **Manage Pages** - Review crawled pages, exclude irrelevant ones, or resync outdated content
5. **Scheduled Recrawls** - Datasources automatically recrawl on a configurable interval

## Crawl Job Statuses

* **`pending`** - Crawl job created, waiting to start
* **`scraping`** - Crawl is actively running and extracting content
* **`completed`** - Crawl finished successfully, content has been indexed
* **`failed`** - Crawl encountered an error and could not complete
* **`cancelled`** - Crawl was manually cancelled before completion

## Page Sync Statuses

* **`synced`** - Page content is indexed in the knowledge base
* **`pending`** - Page is waiting to be synced
* **`error`** - Page failed to sync
* **`excluded`** - Page is excluded from syncing

<Warning>
  Crawling large websites can take significant time and resources. Use include/exclude
  paths to focus on relevant content and set appropriate page limits.
</Warning>

## Available Endpoints

### Datasource Management

<CardGroup>
  <Card title="Create Datasource" icon="plus" href="./create-datasource">
    Create a new website datasource and start crawling
  </Card>

  <Card title="List Datasources" icon="list" href="./list-datasources">
    List all website datasources for your organization
  </Card>

  <Card title="Get Datasource" icon="eye" href="./get-datasource">
    Get datasource details with page stats
  </Card>

  <Card title="Update Datasource" icon="pen" href="./update-datasource">
    Update datasource configuration
  </Card>

  <Card title="Delete Datasource" icon="trash" href="./delete-datasource">
    Delete a website datasource
  </Card>
</CardGroup>

### Crawl Operations

<CardGroup>
  <Card title="Start Crawl" icon="play" href="./start-crawl">
    Start a new crawl for a datasource
  </Card>

  <Card title="Cancel Crawl" icon="stop" href="./cancel-crawl">
    Cancel an active crawl
  </Card>

  <Card title="List Crawl Jobs" icon="clock-rotate-left" href="./list-crawl-jobs">
    View crawl history for a datasource
  </Card>

  <Card title="Get Crawl Job" icon="magnifying-glass" href="./get-crawl-job">
    Check the status of a specific crawl job
  </Card>
</CardGroup>

### Page Management

<CardGroup>
  <Card title="List Pages" icon="file-lines" href="./list-pages">
    List crawled pages with filtering options
  </Card>

  <Card title="Exclude Pages" icon="eye-slash" href="./exclude-pages">
    Exclude pages from future syncs
  </Card>

  <Card title="Include Pages" icon="eye" href="./include-pages">
    Re-include previously excluded pages
  </Card>
</CardGroup>
