Web Crawlers Overview

Learn how SiteAssist crawlers work and why they're essential for your AI assistant

What are Web Crawlers?

Web crawlers are the foundation of your AI assistant's knowledge. Think of them as intelligent robots that systematically explore your website, reading and understanding your content so your AI can provide accurate, helpful responses to visitors.

How Crawlers Work

SiteAssist crawlers follow a systematic process:

Sitemap Discovery: First checks for sitemaps at /sitemap.xml and /sitemap-index.xml
URL Discovery: Starting from your provided URL, discovers all linked pages
Content Extraction: Each page's text content is extracted and cleaned
Processing: Content is processed and structured for optimal AI understanding
Indexing: Information is stored in your knowledge base for instant retrieval
Updates: Regular re-crawling keeps your AI's knowledge current

Crawling Performance

SiteAssist crawlers are optimized for speed and efficiency:

~300 pages: Approximately 1 minute
~600 pages: Approximately 5 minutes
Larger sites: Processing time scales proportionally

The crawler intelligently prioritizes sitemaps when available for faster, more comprehensive discovery.

Crawler Types & Presets

Different websites have different structures and content types. SiteAssist offers optimized presets:

Website (General)

Perfect for most business websites, marketing sites, and general content.

Focuses on main content areas
Filters out navigation and promotional content
Optimized for customer support scenarios

Documentation

Specialized for technical documentation, help centers, and knowledge bases.

Prioritizes structured content and hierarchies
Extracts code examples and technical details
Maintains logical content relationships

Blog

Optimized for news sites, blogs, and article-heavy content.

Focuses on article content and metadata
Extracts publication dates and author information
Handles content archives effectively

E-commerce (Coming Soon)

Designed for online stores and product catalogs.

Will extract product information and descriptions
Handle dynamic pricing and inventory content
Support product categorization

Benefits of Smart Crawling

Comprehensive Coverage

Automatically discovers all publicly accessible pages
No manual URL management required
Finds content you might have forgotten about

Always Up-to-Date

Scheduled re-crawling keeps information current
Detects new pages and content changes
Removes outdated information automatically

Intelligent Processing

Filters out irrelevant content (ads, navigation, footers)
Focuses on valuable information for customer support
Optimizes content for AI understanding

Scalable Solution

Handles websites of any size
Efficient crawling doesn't impact site performance
Works with all major CMS platforms

What Gets Crawled

Included Content:

Main page content and articles
Product descriptions and documentation
FAQ sections and help content
About pages and company information
Blog posts and news articles

Filtered Out:

Navigation menus and headers
Advertisements and promotional banners
Cookie notices and legal disclaimers
Duplicate or boilerplate content
Private or password-protected areas

Getting Started

Ready to set up your first crawler? The process is simple:

Create a new crawler with your website URL
Configure settings for your specific needs
Monitor the crawling process and results
Test your AI assistant with the newly indexed content

Quick Start Tip: Most users can get excellent results with just a website URL and the "Website" preset. Advanced configuration is available when you need more control.

Need Help?

Crawling not working as expected? Check our troubleshooting guide or contact our support team at support@siteassist.io.

On this page