Crawler Configuration

While most users get excellent results with basic crawler setup, SiteAssist offers advanced configuration options for precise control over how your content is crawled and indexed.

Accessing Configuration

To configure your crawler:

  1. Navigate to Web Crawlers in your project sidebar
  2. Click on your existing crawler to open its management page
  3. Switch to the Configuration tab

Configuration Options

Start URL

What it is: The starting point where your crawler begins its discovery process.

How it works:

  • Crawler starts at this URL and discovers linked pages
  • Acts as the entry point for content discovery
  • Must be publicly accessible

Configuration:

  • Update anytime from the Configuration tab
  • Helpful when your site structure changes
  • Can point to specific sections (e.g., /help, /docs)

Examples:

yourcompany.com          → Crawls entire site
yourcompany.com/help     → Starts from help section
docs.yourcompany.com     → Crawls documentation subdomain

Best Practices:

  • Use your main domain for comprehensive coverage
  • Use specific sections for focused crawling
  • Ensure the URL contains links to important pages

Crawl Frequency

What it is: How often SiteAssist automatically re-crawls your website to keep the AI's knowledge up-to-date.

Why it matters:

  • Keeps your AI assistant current with website changes
  • Ensures new content is automatically indexed
  • Removes outdated information

Frequency Options

Every 6 Hours

  • Best for: Frequently updated sites (news, blogs, e-commerce)
  • Use case: Sites with daily content changes
  • Consideration: Higher resource usage

Every 12 Hours

  • Best for: Regular content updates
  • Use case: Business sites with weekly updates
  • Consideration: Balanced approach

Every Day (Recommended)

  • Best for: Most business websites
  • Use case: Sites with occasional updates
  • Consideration: Optimal balance of freshness and efficiency

Every Week

  • Best for: Stable websites with infrequent changes
  • Use case: Marketing sites, documentation sites
  • Consideration: Lower resource usage

Every Month

  • Best for: Static websites or rarely updated content
  • Use case: Company info sites, established documentation
  • Consideration: Minimal resource usage

Exclusion Rules

What it is: A powerful system to prevent the crawler from indexing specific URLs or URL patterns.

Why use exclusion rules:

  • Skip irrelevant pages (admin areas, login pages)
  • Avoid duplicate content
  • Exclude outdated or private sections
  • Improve crawling efficiency

Rule Types

URL Starts With

  • Matches URLs that begin with the specified text
  • Perfect for excluding entire sections
https://yourcompany.com/admin/          → Excludes all admin pages
https://yourcompany.com/private/        → Excludes private sections
https://yourcompany.com/temp/           → Excludes temporary pages

URL Ends With

  • Matches URLs that end with the specified text
  • Great for excluding file types or specific page patterns
.pdf            → Excludes all PDF files
/login          → Excludes login pages
.xml            → Excludes XML files

URL Contains

  • Matches URLs that contain the specified text anywhere
  • Useful for excluding pages with specific keywords
/archive        → Excludes pages with "archive" in URL
?print=         → Excludes print versions
/old            → Excludes old content sections

URL Exact Match

  • Matches the exact URL only
  • Precise control for specific pages
https://yoursite.com/contact        → Excludes only the contact page
https://yoursite.com/privacy        → Excludes only privacy policy
https://yoursite.com/terms          → Excludes only terms of service

Adding Exclusion Rules

  1. In the Configuration tab, find the Exclusion Rules section
  2. Click Add new Rule
  3. Choose your rule type (starts with, ends with, contains, exact match)
  4. Enter the URL pattern to exclude
  5. Click Save

Max URLs

What it is: A safety limit that stops crawling after reaching a specified number of pages.

Why it exists:

  • Cost Control: Prevents unexpected crawling costs on large sites
  • Resource Management: Ensures efficient use of crawling resources
  • Site Protection: Avoids overwhelming your website server
  • Quality Focus: Encourages focusing on important content

Setting Max URLs

Small Business Sites (1-100 pages):

  • Set limit: 200-500 URLs
  • Provides buffer for growth

Medium Sites (100-1000 pages):

  • Set limit: 1,500-2,000 URLs
  • Accommodates comprehensive crawling

Large Sites (1000+ pages):

  • Set limit: 3,000-5,000+ URLs
  • Consider using exclusion rules to focus on important sections

What Happens at the Limit

When the crawler reaches your Max URLs setting:

  1. Crawling stops immediately
  2. Already discovered URLs are prioritized by importance
  3. Status shows "Completed (Max URLs reached)"
  4. You can increase the limit and re-run the crawler

Configuration Best Practices

Start Small, Expand Gradually

  1. Begin with basic settings
  2. Monitor initial crawl results
  3. Add exclusion rules as needed
  4. Adjust frequency based on content update patterns

Use Exclusion Rules Strategically

  • Focus on excluding truly irrelevant content
  • Don't over-exclude - your AI might miss useful information
  • Test exclusions with small changes first

Monitor Performance

  • Check crawl logs for errors
  • Verify important pages are being crawled
  • Adjust Max URLs if hitting limits frequently

Regular Review

  • Review configuration monthly
  • Update exclusion rules as site structure changes
  • Adjust frequency based on actual content update patterns

Advanced Configuration Examples

E-commerce Site Example

Start URL: yourstore.com
Exclusion Rules:
  - URL Contains: /cart
  - URL Contains: /checkout
  - URL Contains: /account
  - URL Starts With: /admin
  - URL Ends With: .jpg
Max URLs: 2000
Frequency: Every day

Documentation Site Example

Start URL: docs.yourcompany.com
Exclusion Rules:
  - URL Starts With: /api-v1
  - URL Contains: /deprecated
  - URL Ends With: .pdf
Max URLs: 1000
Frequency: Every week

Business Website Example

Start URL: yourcompany.com
Exclusion Rules:
  - URL Exact Match: yourcompany.com/privacy
  - URL Starts With: /admin
  - URL Contains: ?print=
Max URLs: 500
Frequency: Every day

Testing Your Configuration

After updating your crawler configuration:

  1. Run a Test Crawl to see immediate results
  2. Check the Logs in the Overview tab for any issues
  3. Review Indexed Content to ensure quality
  4. Test Your AI Assistant with questions about your content

Troubleshooting Configuration Issues

Crawler Missing Important Pages:

  • Check if exclusion rules are too broad
  • Verify start URL includes links to important sections
  • Consider increasing Max URLs limit

Crawler Indexing Irrelevant Content:

  • Add more specific exclusion rules
  • Use "URL Contains" rules for broad exclusions
  • Review and refine existing rules

Slow Crawling Performance:

  • Reduce crawl frequency if not needed
  • Add exclusion rules for large file types
  • Lower Max URLs temporarily

Need Help?

Configuration can be complex for unique website structures. Our team is here to help:

  • Email: support@siteassist.io
  • We can help with: Custom exclusion rules, optimal frequency settings, troubleshooting crawl issues

Pro Tip: Start with conservative settings and gradually optimize. It's easier to expand crawling than to clean up over-crawled content!