Troubleshooting Web Crawlers

Having trouble with your web crawler? This guide covers the most common issues and their solutions to get your crawler working smoothly.

Common Setup Issues

🚫 Invalid URL Format Error

Problem: Users enter URLs with https:// prefix, causing validation errors.

Why this happens:

The Start URL field automatically adds https:// prefix
Adding https:// manually creates a double prefix (https://https://yoursite.com)
System rejects malformed URLs

Solution:

❌ Wrong: https://yourcompany.com
✅ Correct: yourcompany.com

❌ Wrong: http://yourcompany.com
✅ Correct: yourcompany.com

❌ Wrong: https://docs.yourcompany.com/help
✅ Correct: docs.yourcompany.com/help

Quick Fix:

Remove https:// or http:// from your URL
Enter only the domain and path
The system will automatically add the secure protocol

🔄 Single Page Application (SPA) Crawling Issues

Problem: Crawler finds very few pages on React, Vue, or Angular websites.

Why this happens:

SPAs render content with JavaScript after page load
Crawlers see the initial HTML without JavaScript-rendered content
Most navigation happens client-side, not through traditional links

Symptoms:

Only 1-2 pages discovered when you expect many more
Missing content from dynamic sections
Crawler stops quickly with low page count

Solutions:

Option 1: Use Sitemap (Recommended)

Ensure your SPA generates a sitemap.xml
Include all routes in your sitemap
SiteAssist will use the sitemap for comprehensive crawling

Option 2: Server-Side Rendering (SSR)

Implement Next.js, Nuxt.js, or similar SSR framework
Ensures content is available when crawler visits
Best long-term solution for SEO and crawling

Option 3: Prerendering

Use tools like Prerender.io or Netlify's prerendering
Generates static HTML for crawler visits
Good compromise for existing SPAs

Option 4: Manual URL Management

Use exclusion rules strategically
Focus crawler on content-heavy sections
Consider multiple targeted crawlers

Crawling Performance Issues

⏱️ Crawler Running Too Long

Problem: Crawler seems stuck or runs much longer than expected.

Possible Causes & Solutions:

Large Website

Cause: Site has thousands of pages
Solution: Set appropriate Max URLs limit (start with 1,000-2,000)
Prevention: Use exclusion rules to focus on important content

Slow Server Response

Cause: Your website responds slowly to requests
Solution: Check your website's performance and hosting
Workaround: Increase timeout patience, crawl during off-peak hours

Infinite URL Loops

Cause: Dynamic URLs creating endless variations
Solution: Add exclusion rules for URL parameters

URL Contains: ?page=
URL Contains: ?filter=
URL Contains: ?sort=

Deep Link Structure

Cause: Very deep navigation hierarchies
Solution: Start crawling from specific sections rather than root

📊 Very Low Page Discovery

Problem: Crawler finds far fewer pages than expected.

Common Causes & Fixes:

Poor Internal Linking

Cause: Pages aren't linked from main navigation
Solution: Improve site navigation or use sitemap crawling
Check: Ensure important pages are linked from your homepage

Broken Internal Links

Cause: Links pointing to non-existent pages
Solution: Fix broken links in your website navigation
Tool: Use website crawling tools to identify broken links

JavaScript-Heavy Navigation

Cause: Navigation menu built entirely with JavaScript
Solution: Add HTML fallback navigation or use sitemap
Best Practice: Ensure critical pages have HTML links

Robots.txt Restrictions

Cause: Your robots.txt file blocks crawling
Solution: Check and adjust robots.txt permissions
Test: Use Google Search Console to test robot access

Content Quality Issues

🗑️ Crawler Indexing Irrelevant Content

Problem: AI assistant knows about admin pages, shopping carts, or other irrelevant content.

Solution Strategy:

Add Strategic Exclusion Rules

Common exclusions for business sites:
- URL Starts With: https://yourcompany.com/admin
- URL Starts With: https://yourcompany.com/wp-admin
- URL Contains: /cart
- URL Contains: /checkout
- URL Contains: /login
- URL Contains: /register
- URL Ends With: .pdf
- URL Contains: ?print=

E-commerce Specific Exclusions

- URL Contains: /cart
- URL Contains: /checkout
- URL Contains: /account
- URL Contains: /wishlist
- URL Starts With: https://yourcompany.com/customer
- URL Contains: ?variant=

Review and Refine

Run initial crawl to see what gets indexed
Identify unwanted content in the logs
Add specific exclusion rules
Re-run crawler to test improvements

❌ Missing Important Pages

Problem: Key pages aren't being crawled and indexed.

Diagnostic Steps:

Check Start URL

Ensure start URL links to or navigates to missing pages
Try starting from a section closer to the missing content

Review Exclusion Rules

Check if exclusion rules are too broad
Temporarily disable rules to test
Refine overly aggressive exclusions

Verify Page Accessibility

Ensure pages are publicly accessible (no login required)
Check that pages return 200 HTTP status
Verify pages aren't blocked by robots.txt

Check Max URLs Limit

Increase limit if crawler stops before reaching important pages
Monitor crawl progress to see where it stops

Technical Issues

🚨 Crawler Fails to Start

Problem: Crawler won't begin crawling at all.

Troubleshooting Checklist:

URL Accessibility

Test your URL:
1. Open start URL in browser
2. Verify page loads completely
3. Check for password protection
4. Ensure SSL certificate is valid

Domain Configuration

Verify domain is spelled correctly
Check for typos in subdomain names
Ensure DNS is properly configured

Server Issues

Check if your website is currently down
Verify hosting provider isn't blocking crawlers
Test during different times if server is overloaded

⚠️ Partial Crawl Results

Problem: Crawler stops unexpectedly with partial results.

Common Causes:

Hit Max URLs Limit

Check: Crawler status shows "Max URLs reached"
Solution: Increase Max URLs limit or add exclusion rules

Server Rate Limiting

Cause: Your server blocks rapid requests
Solution: Contact support for crawler rate adjustment
Prevention: Ensure hosting can handle crawler traffic

Website Structure Changes

Cause: Site navigation changed during crawl
Solution: Re-run crawler after changes settle
Prevention: Schedule crawls during maintenance windows

Website-Specific Solutions

WordPress Sites

Common Issues:

WP admin pages getting crawled
Plugin pages creating noise
Duplicate content from categories/tags

Recommended Exclusions:

URL Starts With: https://yourcompany.com/wp-admin
URL Starts With: https://yourcompany.com/wp-login
URL Contains: /author/
URL Contains: /tag/
URL Contains: /date/
URL Contains: ?p=

Shopify Stores

Common Issues:

Product variants creating duplicate URLs
Customer account pages
Checkout process pages

Recommended Exclusions:

URL Contains: /account
URL Contains: /cart
URL Contains: /checkout
URL Contains: ?variant=
URL Starts With: https://yourcompany.com/admin

Documentation Sites

Common Issues:

Multiple versions creating confusion
API references in wrong format
Download files being indexed

Recommended Exclusions:

URL Contains: /v1/
URL Contains: /deprecated
URL Ends With: .pdf
URL Ends With: .zip
URL Contains: /download

Getting Help

When to Contact Support

Reach out to us at support@siteassist.io if:

Crawler consistently fails after trying solutions above
Your website has unique architecture needs
You need help setting up complex exclusion rules
Crawling works but AI responses are poor quality
You're seeing unexpected behavior not covered here

Information to Include

When contacting support, please provide:

Your website URL
Crawler configuration details
Description of the problem
Screenshots of crawler status/logs
Expected vs. actual results

Quick Diagnostic Steps

Before contacting support, try:

Test in Private Browser: Rule out caching issues
Check Crawler Logs: Look for specific error messages
Try Different Start URL: Test with a simpler starting point
Disable All Exclusions: See if rules are causing issues
Increase Max URLs: Rule out limit-related problems

Prevention Tips

Set Yourself Up for Success

Website Preparation:

Ensure clear navigation structure
Generate and maintain sitemap.xml
Use descriptive, crawlable URLs
Minimize JavaScript-dependent navigation

Crawler Configuration:

Start with conservative settings
Test with small Max URLs first
Add exclusions gradually
Monitor results after changes

Regular Maintenance:

Review crawler performance monthly
Update exclusions as site changes
Adjust frequency based on content updates
Test AI responses periodically

Remember: Most crawling issues have simple solutions. Start with the basics (URL format, exclusions, limits) before diving into complex configurations.

Troubleshooting Crawlers

Troubleshooting Web Crawlers

Common Setup Issues

🚫 Invalid URL Format Error

🔄 Single Page Application (SPA) Crawling Issues

Crawling Performance Issues

⏱️ Crawler Running Too Long

📊 Very Low Page Discovery

Content Quality Issues

🗑️ Crawler Indexing Irrelevant Content

❌ Missing Important Pages

Technical Issues

🚨 Crawler Fails to Start

⚠️ Partial Crawl Results

Website-Specific Solutions

WordPress Sites

Shopify Stores

Documentation Sites

Getting Help

When to Contact Support

Information to Include

Quick Diagnostic Steps

Prevention Tips

Set Yourself Up for Success

On this page