Crawler Configuration

While most users get excellent results with basic crawler setup, SiteAssist offers advanced configuration options for precise control over how your content is crawled and indexed.

Accessing Configuration

To configure your crawler:

Navigate to Web Crawlers in your project sidebar
Click on your existing crawler to open its management page
Switch to the Configuration tab

Configuration Options

Start URL

What it is: The starting point where your crawler begins its discovery process.

How it works:

Crawler starts at this URL and discovers linked pages
Acts as the entry point for content discovery
Must be publicly accessible

Configuration:

Update anytime from the Configuration tab
Helpful when your site structure changes
Can point to specific sections (e.g., /help, /docs)

Examples:

yourcompany.com          → Crawls entire site
yourcompany.com/help     → Starts from help section
docs.yourcompany.com     → Crawls documentation subdomain

Best Practices:

Use your main domain for comprehensive coverage
Use specific sections for focused crawling
Ensure the URL contains links to important pages

Crawl Frequency

What it is: How often SiteAssist automatically re-crawls your website to keep the AI's knowledge up-to-date.

Why it matters:

Keeps your AI assistant current with website changes
Ensures new content is automatically indexed
Removes outdated information

Frequency Options

Every 6 Hours

Best for: Frequently updated sites (news, blogs, e-commerce)
Use case: Sites with daily content changes
Consideration: Higher resource usage

Every 12 Hours

Best for: Regular content updates
Use case: Business sites with weekly updates
Consideration: Balanced approach

Every Day (Recommended)

Best for: Most business websites
Use case: Sites with occasional updates
Consideration: Optimal balance of freshness and efficiency

Every Week

Best for: Stable websites with infrequent changes
Use case: Marketing sites, documentation sites
Consideration: Lower resource usage

Every Month

Best for: Static websites or rarely updated content
Use case: Company info sites, established documentation
Consideration: Minimal resource usage

Exclusion Rules

What it is: A powerful system to prevent the crawler from indexing specific URLs or URL patterns.

Why use exclusion rules:

Skip irrelevant pages (admin areas, login pages)
Avoid duplicate content
Exclude outdated or private sections
Improve crawling efficiency

Rule Types

URL Starts With

Matches URLs that begin with the specified text
Perfect for excluding entire sections

https://yourcompany.com/admin/          → Excludes all admin pages
https://yourcompany.com/private/        → Excludes private sections
https://yourcompany.com/temp/           → Excludes temporary pages

URL Ends With

Matches URLs that end with the specified text
Great for excluding file types or specific page patterns

.pdf            → Excludes all PDF files
/login          → Excludes login pages
.xml            → Excludes XML files

URL Contains

Matches URLs that contain the specified text anywhere
Useful for excluding pages with specific keywords

/archive        → Excludes pages with "archive" in URL
?print=         → Excludes print versions
/old            → Excludes old content sections

URL Exact Match

Matches the exact URL only
Precise control for specific pages

https://yoursite.com/contact        → Excludes only the contact page
https://yoursite.com/privacy        → Excludes only privacy policy
https://yoursite.com/terms          → Excludes only terms of service

Adding Exclusion Rules

In the Configuration tab, find the Exclusion Rules section
Click Add new Rule
Choose your rule type (starts with, ends with, contains, exact match)
Enter the URL pattern to exclude
Click Save

Max URLs

What it is: A safety limit that stops crawling after reaching a specified number of pages.

Why it exists:

Cost Control: Prevents unexpected crawling costs on large sites
Resource Management: Ensures efficient use of crawling resources
Site Protection: Avoids overwhelming your website server
Quality Focus: Encourages focusing on important content

Setting Max URLs

Small Business Sites (1-100 pages):

Set limit: 200-500 URLs
Provides buffer for growth

Medium Sites (100-1000 pages):

Set limit: 1,500-2,000 URLs
Accommodates comprehensive crawling

Large Sites (1000+ pages):

Set limit: 3,000-5,000+ URLs
Consider using exclusion rules to focus on important sections

What Happens at the Limit

When the crawler reaches your Max URLs setting:

Crawling stops immediately
Already discovered URLs are prioritized by importance
Status shows "Completed (Max URLs reached)"
You can increase the limit and re-run the crawler

Configuration Best Practices

Start Small, Expand Gradually

Begin with basic settings
Monitor initial crawl results
Add exclusion rules as needed
Adjust frequency based on content update patterns

Use Exclusion Rules Strategically

Focus on excluding truly irrelevant content
Don't over-exclude - your AI might miss useful information
Test exclusions with small changes first

Monitor Performance

Check crawl logs for errors
Verify important pages are being crawled
Adjust Max URLs if hitting limits frequently

Regular Review

Review configuration monthly
Update exclusion rules as site structure changes
Adjust frequency based on actual content update patterns

Advanced Configuration Examples

E-commerce Site Example

Start URL: yourstore.com
Exclusion Rules:
  - URL Contains: /cart
  - URL Contains: /checkout
  - URL Contains: /account
  - URL Starts With: /admin
  - URL Ends With: .jpg
Max URLs: 2000
Frequency: Every day

Documentation Site Example

Start URL: docs.yourcompany.com
Exclusion Rules:
  - URL Starts With: /api-v1
  - URL Contains: /deprecated
  - URL Ends With: .pdf
Max URLs: 1000
Frequency: Every week

Business Website Example

Start URL: yourcompany.com
Exclusion Rules:
  - URL Exact Match: yourcompany.com/privacy
  - URL Starts With: /admin
  - URL Contains: ?print=
Max URLs: 500
Frequency: Every day

Testing Your Configuration

After updating your crawler configuration:

Run a Test Crawl to see immediate results
Check the Logs in the Overview tab for any issues
Review Indexed Content to ensure quality
Test Your AI Assistant with questions about your content

Troubleshooting Configuration Issues

Crawler Missing Important Pages:

Check if exclusion rules are too broad
Verify start URL includes links to important sections
Consider increasing Max URLs limit

Crawler Indexing Irrelevant Content:

Add more specific exclusion rules
Use "URL Contains" rules for broad exclusions
Review and refine existing rules

Slow Crawling Performance:

Reduce crawl frequency if not needed
Add exclusion rules for large file types
Lower Max URLs temporarily

Need Help?

Configuration can be complex for unique website structures. Our team is here to help:

Email: support@siteassist.io
We can help with: Custom exclusion rules, optimal frequency settings, troubleshooting crawl issues

Pro Tip: Start with conservative settings and gradually optimize. It's easier to expand crawling than to clean up over-crawled content!

Crawler Configuration

On this page