Web Crawlers Overview
Learn how SiteAssist crawlers work and why they're essential for your AI assistant
What are Web Crawlers?
Web crawlers are the foundation of your AI assistant's knowledge. Think of them as intelligent robots that systematically explore your website, reading and understanding your content so your AI can provide accurate, helpful responses to visitors.
How Crawlers Work
SiteAssist crawlers follow a systematic process:
- Sitemap Discovery: First checks for sitemaps at
/sitemap.xml
and/sitemap-index.xml
- URL Discovery: Starting from your provided URL, discovers all linked pages
- Content Extraction: Each page's text content is extracted and cleaned
- Processing: Content is processed and structured for optimal AI understanding
- Indexing: Information is stored in your knowledge base for instant retrieval
- Updates: Regular re-crawling keeps your AI's knowledge current
Crawling Performance
SiteAssist crawlers are optimized for speed and efficiency:
- ~300 pages: Approximately 1 minute
- ~600 pages: Approximately 5 minutes
- Larger sites: Processing time scales proportionally
The crawler intelligently prioritizes sitemaps when available for faster, more comprehensive discovery.
Crawler Types & Presets
Different websites have different structures and content types. SiteAssist offers optimized presets:
Website (General)
Perfect for most business websites, marketing sites, and general content.
- Focuses on main content areas
- Filters out navigation and promotional content
- Optimized for customer support scenarios
Documentation
Specialized for technical documentation, help centers, and knowledge bases.
- Prioritizes structured content and hierarchies
- Extracts code examples and technical details
- Maintains logical content relationships
Blog
Optimized for news sites, blogs, and article-heavy content.
- Focuses on article content and metadata
- Extracts publication dates and author information
- Handles content archives effectively
E-commerce (Coming Soon)
Designed for online stores and product catalogs.
- Will extract product information and descriptions
- Handle dynamic pricing and inventory content
- Support product categorization
Benefits of Smart Crawling
Comprehensive Coverage
- Automatically discovers all publicly accessible pages
- No manual URL management required
- Finds content you might have forgotten about
Always Up-to-Date
- Scheduled re-crawling keeps information current
- Detects new pages and content changes
- Removes outdated information automatically
Intelligent Processing
- Filters out irrelevant content (ads, navigation, footers)
- Focuses on valuable information for customer support
- Optimizes content for AI understanding
Scalable Solution
- Handles websites of any size
- Efficient crawling doesn't impact site performance
- Works with all major CMS platforms
What Gets Crawled
Included Content:
- Main page content and articles
- Product descriptions and documentation
- FAQ sections and help content
- About pages and company information
- Blog posts and news articles
Filtered Out:
- Navigation menus and headers
- Advertisements and promotional banners
- Cookie notices and legal disclaimers
- Duplicate or boilerplate content
- Private or password-protected areas
Getting Started
Ready to set up your first crawler? The process is simple:
- Create a new crawler with your website URL
- Configure settings for your specific needs
- Monitor the crawling process and results
- Test your AI assistant with the newly indexed content
Quick Start Tip: Most users can get excellent results with just a website URL and the "Website" preset. Advanced configuration is available when you need more control.
Need Help?
Crawling not working as expected? Check our troubleshooting guide or contact our support team at support@siteassist.io.