SiteAssist
Web Crawlers

Creating Crawlers

Step-by-step guide to setting up web crawlers for your website

Creating Web Crawlers

Setting up a web crawler is the first step to building your AI assistant's knowledge base. This guide walks you through creating and configuring your first crawler.

Before You Start

What You'll Need:

  • Your website's main URL (e.g., yourcompany.com - no need for https://)
  • Access to your SiteAssist project
  • Time for initial crawl (1 minute per ~300 pages)

Permissions Required:

  • Your website must be publicly accessible
  • No password protection on pages you want crawled
  • Standard robots.txt compliance (SiteAssist respects crawling rules)

Step-by-Step Guide

1. Access Web Crawlers

Navigate to your project and click on Web Crawlers in the sidebar.

2. Create New Crawler

Click the Add Crawler button to open the crawler creation modal.

3. Basic Configuration

Fill in the essential crawler settings:

Screenshot showing the Create Crawler modal with all fields

Name

Give your crawler a descriptive name for easy identification.

Good examples:

  • "Main Company Website"
  • "Help Documentation"
  • "Product Catalog"
  • "Marketing Site"

Avoid:

  • Generic names like "Crawler 1"
  • Special characters or symbols

Start URL

Enter your website's main URL where crawling should begin.

Format Requirements:

  • The input field has https:// prefix automatically added
  • Just enter your domain without the protocol
  • Can be a specific section starting point

Examples:

  • yourcompany.com
  • docs.yourcompany.com
  • yourcompany.com/products

Smart Discovery: The crawler automatically checks for sitemaps at /sitemap.xml and /sitemap-index.xml to discover pages more efficiently.

Preset Selection

Choose the preset that best matches your website type:

Website

  • Best for: Business websites, marketing sites, general content
  • Optimized for: Customer support, general inquiries
  • Recommended for most users

Documentation

  • Best for: Help centers, technical docs, API documentation
  • Optimized for: Detailed explanations, technical support
  • Maintains content hierarchy and structure

Blog

  • Best for: News sites, blogs, content marketing sites
  • Optimized for: Article content, publication information
  • Handles date-based content organization

E-commerce (Coming Soon)

  • Best for: Online stores, product catalogs
  • Will optimize for: Product information, pricing, descriptions

Not sure which preset to choose? Start with "Website" - it works well for most business sites and you can always create additional crawlers with different presets later.

4. Start Crawling

Once you've configured the basic settings:

  1. Review your configuration
  2. Click Create & Start Indexing
  3. The crawler will immediately start working

What Happens Next

Immediate Actions

  • Crawler status changes to "Crawling"
  • Initial page discovery begins
  • Progress indicators show crawling activity
  • Crawler logs starts showing live logs.

During Crawling (1-5+ minutes typically)

  • Pages are discovered and queued
  • Content is extracted and processed
  • Progress logs show real-time activity
  • You can monitor status in the crawler dashboard

Processing Times:

  • ~300 pages: ~1 minute
  • ~600 pages: ~5 minutes
  • Larger sites: Time scales proportionally

After Completion

  • Status changes to "Completed"
  • Indexed content count is displayed
  • Your AI assistant can now use this knowledge
  • Automatic scheduling begins (if configured)

Managing Your Crawler

Once created, each crawler has a dedicated page with three tabs for complete management:

Overview Tab

  • Real-time crawling status and progress
  • Total pages discovered and indexed
  • Crawling logs and activity history
  • Performance metrics and statistics

Live crawler logs when crawling

Configuration Tab

  • Modify advanced crawler settings
  • Update start URL if needed
  • Adjust URL exclusion rules
  • Change crawling frequency and scheduling

Settings Tab

  • View crawler ID and technical details
  • Access crawler management options
  • Delete crawler (permanent action)

Monitoring Your New Crawler

While your crawler runs, you can:

View Progress

  • See pages being processed in real-time
  • Monitor crawling speed and efficiency
  • Check for any errors or issues in the logs

Live crawler logs when crawling

Test Integration

  • Try asking your AI assistant questions
  • Verify responses use your website content
  • Check accuracy and relevance

Common First-Time Issues

Crawler Stuck or Slow

  • Cause: Large website or slow server response
  • Solution: Be patient; crawling can take time for large sites
  • When to worry: If no progress after 30 minutes

Low Page Count

  • Cause: Limited internal linking or restricted access
  • Solution: Check your website's internal link structure
  • Alternative: Use sitemap URL as start URL

Missing Important Pages

Permission Errors

  • Cause: Password-protected or restricted content
  • Solution: Ensure pages are publicly accessible

Next Steps

Once your crawler is running successfully:

  1. Configure advanced settings for more control
  2. Test your AI assistant with questions about your website content
  3. Consider additional crawlers for different sections or content types

Need Help?

Having trouble with crawler setup?

Want advanced features?