SiteAssist
Web Crawlers

Troubleshooting Crawlers

Solutions to common web crawler issues and problems

Troubleshooting Web Crawlers

Having trouble with your web crawler? This guide covers the most common issues and their solutions to get your crawler working smoothly.

Common Setup Issues

🚫 Invalid URL Format Error

Problem: Users enter URLs with https:// prefix, causing validation errors.

Why this happens:

  • The Start URL field automatically adds https:// prefix
  • Adding https:// manually creates a double prefix (https://https://yoursite.com)
  • System rejects malformed URLs

Solution:

❌ Wrong: https://yourcompany.com
✅ Correct: yourcompany.com

❌ Wrong: http://yourcompany.com
✅ Correct: yourcompany.com

❌ Wrong: https://docs.yourcompany.com/help
✅ Correct: docs.yourcompany.com/help

Quick Fix:

  1. Remove https:// or http:// from your URL
  2. Enter only the domain and path
  3. The system will automatically add the secure protocol

🔄 Single Page Application (SPA) Crawling Issues

Problem: Crawler finds very few pages on React, Vue, or Angular websites.

Why this happens:

  • SPAs render content with JavaScript after page load
  • Crawlers see the initial HTML without JavaScript-rendered content
  • Most navigation happens client-side, not through traditional links

Symptoms:

  • Only 1-2 pages discovered when you expect many more
  • Missing content from dynamic sections
  • Crawler stops quickly with low page count

Solutions:

Option 1: Use Sitemap (Recommended)

  • Ensure your SPA generates a sitemap.xml
  • Include all routes in your sitemap
  • SiteAssist will use the sitemap for comprehensive crawling

Option 2: Server-Side Rendering (SSR)

  • Implement Next.js, Nuxt.js, or similar SSR framework
  • Ensures content is available when crawler visits
  • Best long-term solution for SEO and crawling

Option 3: Prerendering

  • Use tools like Prerender.io or Netlify's prerendering
  • Generates static HTML for crawler visits
  • Good compromise for existing SPAs

Option 4: Manual URL Management

  • Use exclusion rules strategically
  • Focus crawler on content-heavy sections
  • Consider multiple targeted crawlers

Crawling Performance Issues

⏱️ Crawler Running Too Long

Problem: Crawler seems stuck or runs much longer than expected.

Possible Causes & Solutions:

Large Website

  • Cause: Site has thousands of pages
  • Solution: Set appropriate Max URLs limit (start with 1,000-2,000)
  • Prevention: Use exclusion rules to focus on important content

Slow Server Response

  • Cause: Your website responds slowly to requests
  • Solution: Check your website's performance and hosting
  • Workaround: Increase timeout patience, crawl during off-peak hours

Infinite URL Loops

  • Cause: Dynamic URLs creating endless variations
  • Solution: Add exclusion rules for URL parameters
URL Contains: ?page=
URL Contains: ?filter=
URL Contains: ?sort=

Deep Link Structure

  • Cause: Very deep navigation hierarchies
  • Solution: Start crawling from specific sections rather than root

📊 Very Low Page Discovery

Problem: Crawler finds far fewer pages than expected.

Common Causes & Fixes:

Poor Internal Linking

  • Cause: Pages aren't linked from main navigation
  • Solution: Improve site navigation or use sitemap crawling
  • Check: Ensure important pages are linked from your homepage

Broken Internal Links

  • Cause: Links pointing to non-existent pages
  • Solution: Fix broken links in your website navigation
  • Tool: Use website crawling tools to identify broken links

JavaScript-Heavy Navigation

  • Cause: Navigation menu built entirely with JavaScript
  • Solution: Add HTML fallback navigation or use sitemap
  • Best Practice: Ensure critical pages have HTML links

Robots.txt Restrictions

  • Cause: Your robots.txt file blocks crawling
  • Solution: Check and adjust robots.txt permissions
  • Test: Use Google Search Console to test robot access

Content Quality Issues

🗑️ Crawler Indexing Irrelevant Content

Problem: AI assistant knows about admin pages, shopping carts, or other irrelevant content.

Solution Strategy:

Add Strategic Exclusion Rules

Common exclusions for business sites:
- URL Starts With: https://yourcompany.com/admin
- URL Starts With: https://yourcompany.com/wp-admin
- URL Contains: /cart
- URL Contains: /checkout
- URL Contains: /login
- URL Contains: /register
- URL Ends With: .pdf
- URL Contains: ?print=

E-commerce Specific Exclusions

- URL Contains: /cart
- URL Contains: /checkout
- URL Contains: /account
- URL Contains: /wishlist
- URL Starts With: https://yourcompany.com/customer
- URL Contains: ?variant=

Review and Refine

  1. Run initial crawl to see what gets indexed
  2. Identify unwanted content in the logs
  3. Add specific exclusion rules
  4. Re-run crawler to test improvements

❌ Missing Important Pages

Problem: Key pages aren't being crawled and indexed.

Diagnostic Steps:

Check Start URL

  • Ensure start URL links to or navigates to missing pages
  • Try starting from a section closer to the missing content

Review Exclusion Rules

  • Check if exclusion rules are too broad
  • Temporarily disable rules to test
  • Refine overly aggressive exclusions

Verify Page Accessibility

  • Ensure pages are publicly accessible (no login required)
  • Check that pages return 200 HTTP status
  • Verify pages aren't blocked by robots.txt

Check Max URLs Limit

  • Increase limit if crawler stops before reaching important pages
  • Monitor crawl progress to see where it stops

Technical Issues

🚨 Crawler Fails to Start

Problem: Crawler won't begin crawling at all.

Troubleshooting Checklist:

URL Accessibility

Test your URL:
1. Open start URL in browser
2. Verify page loads completely
3. Check for password protection
4. Ensure SSL certificate is valid

Domain Configuration

  • Verify domain is spelled correctly
  • Check for typos in subdomain names
  • Ensure DNS is properly configured

Server Issues

  • Check if your website is currently down
  • Verify hosting provider isn't blocking crawlers
  • Test during different times if server is overloaded

⚠️ Partial Crawl Results

Problem: Crawler stops unexpectedly with partial results.

Common Causes:

Hit Max URLs Limit

  • Check: Crawler status shows "Max URLs reached"
  • Solution: Increase Max URLs limit or add exclusion rules

Server Rate Limiting

  • Cause: Your server blocks rapid requests
  • Solution: Contact support for crawler rate adjustment
  • Prevention: Ensure hosting can handle crawler traffic

Website Structure Changes

  • Cause: Site navigation changed during crawl
  • Solution: Re-run crawler after changes settle
  • Prevention: Schedule crawls during maintenance windows

Website-Specific Solutions

WordPress Sites

Common Issues:

  • WP admin pages getting crawled
  • Plugin pages creating noise
  • Duplicate content from categories/tags

Recommended Exclusions:

URL Starts With: https://yourcompany.com/wp-admin
URL Starts With: https://yourcompany.com/wp-login
URL Contains: /author/
URL Contains: /tag/
URL Contains: /date/
URL Contains: ?p=

Shopify Stores

Common Issues:

  • Product variants creating duplicate URLs
  • Customer account pages
  • Checkout process pages

Recommended Exclusions:

URL Contains: /account
URL Contains: /cart
URL Contains: /checkout
URL Contains: ?variant=
URL Starts With: https://yourcompany.com/admin

Documentation Sites

Common Issues:

  • Multiple versions creating confusion
  • API references in wrong format
  • Download files being indexed

Recommended Exclusions:

URL Contains: /v1/
URL Contains: /deprecated
URL Ends With: .pdf
URL Ends With: .zip
URL Contains: /download

Getting Help

When to Contact Support

Reach out to us at support@siteassist.io if:

  • Crawler consistently fails after trying solutions above
  • Your website has unique architecture needs
  • You need help setting up complex exclusion rules
  • Crawling works but AI responses are poor quality
  • You're seeing unexpected behavior not covered here

Information to Include

When contacting support, please provide:

  1. Your website URL
  2. Crawler configuration details
  3. Description of the problem
  4. Screenshots of crawler status/logs
  5. Expected vs. actual results

Quick Diagnostic Steps

Before contacting support, try:

  1. Test in Private Browser: Rule out caching issues
  2. Check Crawler Logs: Look for specific error messages
  3. Try Different Start URL: Test with a simpler starting point
  4. Disable All Exclusions: See if rules are causing issues
  5. Increase Max URLs: Rule out limit-related problems

Prevention Tips

Set Yourself Up for Success

Website Preparation:

  • Ensure clear navigation structure
  • Generate and maintain sitemap.xml
  • Use descriptive, crawlable URLs
  • Minimize JavaScript-dependent navigation

Crawler Configuration:

  • Start with conservative settings
  • Test with small Max URLs first
  • Add exclusions gradually
  • Monitor results after changes

Regular Maintenance:

  • Review crawler performance monthly
  • Update exclusions as site changes
  • Adjust frequency based on content updates
  • Test AI responses periodically

Remember: Most crawling issues have simple solutions. Start with the basics (URL format, exclusions, limits) before diving into complex configurations.