Crawler Configuration
Advanced settings and configuration options for fine-tuning your web crawlers
Crawler Configuration
While most users get excellent results with basic crawler setup, SiteAssist offers advanced configuration options for precise control over how your content is crawled and indexed.
Accessing Configuration
To configure your crawler:
- Navigate to Web Crawlers in your project sidebar
- Click on your existing crawler to open its management page
- Switch to the Configuration tab
Configuration Options
Start URL
What it is: The starting point where your crawler begins its discovery process.
How it works:
- Crawler starts at this URL and discovers linked pages
- Acts as the entry point for content discovery
- Must be publicly accessible
Configuration:
- Update anytime from the Configuration tab
- Helpful when your site structure changes
- Can point to specific sections (e.g.,
/help,/docs)
Examples:
yourcompany.com → Crawls entire site
yourcompany.com/help → Starts from help section
docs.yourcompany.com → Crawls documentation subdomainBest Practices:
- Use your main domain for comprehensive coverage
- Use specific sections for focused crawling
- Ensure the URL contains links to important pages
Crawl Frequency
What it is: How often SiteAssist automatically re-crawls your website to keep the AI's knowledge up-to-date.
Why it matters:
- Keeps your AI assistant current with website changes
- Ensures new content is automatically indexed
- Removes outdated information
Frequency Options
Every 6 Hours
- Best for: Frequently updated sites (news, blogs, e-commerce)
- Use case: Sites with daily content changes
- Consideration: Higher resource usage
Every 12 Hours
- Best for: Regular content updates
- Use case: Business sites with weekly updates
- Consideration: Balanced approach
Every Day (Recommended)
- Best for: Most business websites
- Use case: Sites with occasional updates
- Consideration: Optimal balance of freshness and efficiency
Every Week
- Best for: Stable websites with infrequent changes
- Use case: Marketing sites, documentation sites
- Consideration: Lower resource usage
Every Month
- Best for: Static websites or rarely updated content
- Use case: Company info sites, established documentation
- Consideration: Minimal resource usage
Exclusion Rules
What it is: A powerful system to prevent the crawler from indexing specific URLs or URL patterns.
Why use exclusion rules:
- Skip irrelevant pages (admin areas, login pages)
- Avoid duplicate content
- Exclude outdated or private sections
- Improve crawling efficiency
Rule Types
URL Starts With
- Matches URLs that begin with the specified text
- Perfect for excluding entire sections
https://yourcompany.com/admin/ → Excludes all admin pages
https://yourcompany.com/private/ → Excludes private sections
https://yourcompany.com/temp/ → Excludes temporary pagesURL Ends With
- Matches URLs that end with the specified text
- Great for excluding file types or specific page patterns
.pdf → Excludes all PDF files
/login → Excludes login pages
.xml → Excludes XML filesURL Contains
- Matches URLs that contain the specified text anywhere
- Useful for excluding pages with specific keywords
/archive → Excludes pages with "archive" in URL
?print= → Excludes print versions
/old → Excludes old content sectionsURL Exact Match
- Matches the exact URL only
- Precise control for specific pages
https://yoursite.com/contact → Excludes only the contact page
https://yoursite.com/privacy → Excludes only privacy policy
https://yoursite.com/terms → Excludes only terms of serviceAdding Exclusion Rules
- In the Configuration tab, find the Exclusion Rules section
- Click Add new Rule
- Choose your rule type (starts with, ends with, contains, exact match)
- Enter the URL pattern to exclude
- Click Save
Max URLs
What it is: A safety limit that stops crawling after reaching a specified number of pages.
Why it exists:
- Cost Control: Prevents unexpected crawling costs on large sites
- Resource Management: Ensures efficient use of crawling resources
- Site Protection: Avoids overwhelming your website server
- Quality Focus: Encourages focusing on important content
Setting Max URLs
Small Business Sites (1-100 pages):
- Set limit: 200-500 URLs
- Provides buffer for growth
Medium Sites (100-1000 pages):
- Set limit: 1,500-2,000 URLs
- Accommodates comprehensive crawling
Large Sites (1000+ pages):
- Set limit: 3,000-5,000+ URLs
- Consider using exclusion rules to focus on important sections
What Happens at the Limit
When the crawler reaches your Max URLs setting:
- Crawling stops immediately
- Already discovered URLs are prioritized by importance
- Status shows "Completed (Max URLs reached)"
- You can increase the limit and re-run the crawler
Configuration Best Practices
Start Small, Expand Gradually
- Begin with basic settings
- Monitor initial crawl results
- Add exclusion rules as needed
- Adjust frequency based on content update patterns
Use Exclusion Rules Strategically
- Focus on excluding truly irrelevant content
- Don't over-exclude - your AI might miss useful information
- Test exclusions with small changes first
Monitor Performance
- Check crawl logs for errors
- Verify important pages are being crawled
- Adjust Max URLs if hitting limits frequently
Regular Review
- Review configuration monthly
- Update exclusion rules as site structure changes
- Adjust frequency based on actual content update patterns
Advanced Configuration Examples
E-commerce Site Example
Start URL: yourstore.com
Exclusion Rules:
- URL Contains: /cart
- URL Contains: /checkout
- URL Contains: /account
- URL Starts With: /admin
- URL Ends With: .jpg
Max URLs: 2000
Frequency: Every dayDocumentation Site Example
Start URL: docs.yourcompany.com
Exclusion Rules:
- URL Starts With: /api-v1
- URL Contains: /deprecated
- URL Ends With: .pdf
Max URLs: 1000
Frequency: Every weekBusiness Website Example
Start URL: yourcompany.com
Exclusion Rules:
- URL Exact Match: yourcompany.com/privacy
- URL Starts With: /admin
- URL Contains: ?print=
Max URLs: 500
Frequency: Every dayTesting Your Configuration
After updating your crawler configuration:
- Run a Test Crawl to see immediate results
- Check the Logs in the Overview tab for any issues
- Review Indexed Content to ensure quality
- Test Your AI Assistant with questions about your content
Troubleshooting Configuration Issues
Crawler Missing Important Pages:
- Check if exclusion rules are too broad
- Verify start URL includes links to important sections
- Consider increasing Max URLs limit
Crawler Indexing Irrelevant Content:
- Add more specific exclusion rules
- Use "URL Contains" rules for broad exclusions
- Review and refine existing rules
Slow Crawling Performance:
- Reduce crawl frequency if not needed
- Add exclusion rules for large file types
- Lower Max URLs temporarily
Need Help?
Configuration can be complex for unique website structures. Our team is here to help:
- Email: support@siteassist.io
- We can help with: Custom exclusion rules, optimal frequency settings, troubleshooting crawl issues
Pro Tip: Start with conservative settings and gradually optimize. It's easier to expand crawling than to clean up over-crawled content!