Troubleshooting Web Crawlers

Having trouble with your web crawler? This guide covers the most common issues and their solutions to get your crawler working smoothly.

Common Setup Issues

🚫 Invalid URL Format Error

Problem: Users enter URLs with https:// prefix, causing validation errors.

Why this happens:

  • The Start URL field automatically adds https:// prefix
  • Adding https:// manually creates a double prefix (https://https://yoursite.com)
  • System rejects malformed URLs

Solution:

❌ Wrong: https://yourcompany.com
✅ Correct: yourcompany.com

❌ Wrong: http://yourcompany.com
✅ Correct: yourcompany.com

❌ Wrong: https://docs.yourcompany.com/help
✅ Correct: docs.yourcompany.com/help

Quick Fix:

  1. Remove https:// or http:// from your URL
  2. Enter only the domain and path
  3. The system will automatically add the secure protocol

🔄 Single Page Application (SPA) Crawling Issues

Problem: Crawler finds very few pages on React, Vue, or Angular websites.

Why this happens:

  • SPAs render content with JavaScript after page load
  • Crawlers see the initial HTML without JavaScript-rendered content
  • Most navigation happens client-side, not through traditional links

Symptoms:

  • Only 1-2 pages discovered when you expect many more
  • Missing content from dynamic sections
  • Crawler stops quickly with low page count

Solutions:

Option 1: Use Sitemap (Recommended)

  • Ensure your SPA generates a sitemap.xml
  • Include all routes in your sitemap
  • SiteAssist will use the sitemap for comprehensive crawling

Option 2: Server-Side Rendering (SSR)

  • Implement Next.js, Nuxt.js, or similar SSR framework
  • Ensures content is available when crawler visits
  • Best long-term solution for SEO and crawling

Option 3: Prerendering

  • Use tools like Prerender.io or Netlify's prerendering
  • Generates static HTML for crawler visits
  • Good compromise for existing SPAs

Option 4: Manual URL Management

  • Use exclusion rules strategically
  • Focus crawler on content-heavy sections
  • Consider multiple targeted crawlers

Crawling Performance Issues

⏱️ Crawler Running Too Long

Problem: Crawler seems stuck or runs much longer than expected.

Possible Causes & Solutions:

Large Website

  • Cause: Site has thousands of pages
  • Solution: Set appropriate Max URLs limit (start with 1,000-2,000)
  • Prevention: Use exclusion rules to focus on important content

Slow Server Response

  • Cause: Your website responds slowly to requests
  • Solution: Check your website's performance and hosting
  • Workaround: Increase timeout patience, crawl during off-peak hours

Infinite URL Loops

  • Cause: Dynamic URLs creating endless variations
  • Solution: Add exclusion rules for URL parameters
URL Contains: ?page=
URL Contains: ?filter=
URL Contains: ?sort=

Deep Link Structure

  • Cause: Very deep navigation hierarchies
  • Solution: Start crawling from specific sections rather than root

📊 Very Low Page Discovery

Problem: Crawler finds far fewer pages than expected.

Common Causes & Fixes:

Poor Internal Linking

  • Cause: Pages aren't linked from main navigation
  • Solution: Improve site navigation or use sitemap crawling
  • Check: Ensure important pages are linked from your homepage

Broken Internal Links

  • Cause: Links pointing to non-existent pages
  • Solution: Fix broken links in your website navigation
  • Tool: Use website crawling tools to identify broken links

JavaScript-Heavy Navigation

  • Cause: Navigation menu built entirely with JavaScript
  • Solution: Add HTML fallback navigation or use sitemap
  • Best Practice: Ensure critical pages have HTML links

Robots.txt Restrictions

  • Cause: Your robots.txt file blocks crawling
  • Solution: Check and adjust robots.txt permissions
  • Test: Use Google Search Console to test robot access

Content Quality Issues

🗑️ Crawler Indexing Irrelevant Content

Problem: AI assistant knows about admin pages, shopping carts, or other irrelevant content.

Solution Strategy:

Add Strategic Exclusion Rules

Common exclusions for business sites:
- URL Starts With: https://yourcompany.com/admin
- URL Starts With: https://yourcompany.com/wp-admin
- URL Contains: /cart
- URL Contains: /checkout
- URL Contains: /login
- URL Contains: /register
- URL Ends With: .pdf
- URL Contains: ?print=

E-commerce Specific Exclusions

- URL Contains: /cart
- URL Contains: /checkout
- URL Contains: /account
- URL Contains: /wishlist
- URL Starts With: https://yourcompany.com/customer
- URL Contains: ?variant=

Review and Refine

  1. Run initial crawl to see what gets indexed
  2. Identify unwanted content in the logs
  3. Add specific exclusion rules
  4. Re-run crawler to test improvements

❌ Missing Important Pages

Problem: Key pages aren't being crawled and indexed.

Diagnostic Steps:

Check Start URL

  • Ensure start URL links to or navigates to missing pages
  • Try starting from a section closer to the missing content

Review Exclusion Rules

  • Check if exclusion rules are too broad
  • Temporarily disable rules to test
  • Refine overly aggressive exclusions

Verify Page Accessibility

  • Ensure pages are publicly accessible (no login required)
  • Check that pages return 200 HTTP status
  • Verify pages aren't blocked by robots.txt

Check Max URLs Limit

  • Increase limit if crawler stops before reaching important pages
  • Monitor crawl progress to see where it stops

Technical Issues

🚨 Crawler Fails to Start

Problem: Crawler won't begin crawling at all.

Troubleshooting Checklist:

URL Accessibility

Test your URL:
1. Open start URL in browser
2. Verify page loads completely
3. Check for password protection
4. Ensure SSL certificate is valid

Domain Configuration

  • Verify domain is spelled correctly
  • Check for typos in subdomain names
  • Ensure DNS is properly configured

Server Issues

  • Check if your website is currently down
  • Verify hosting provider isn't blocking crawlers
  • Test during different times if server is overloaded

⚠️ Partial Crawl Results

Problem: Crawler stops unexpectedly with partial results.

Common Causes:

Hit Max URLs Limit

  • Check: Crawler status shows "Max URLs reached"
  • Solution: Increase Max URLs limit or add exclusion rules

Server Rate Limiting

  • Cause: Your server blocks rapid requests
  • Solution: Contact support for crawler rate adjustment
  • Prevention: Ensure hosting can handle crawler traffic

Website Structure Changes

  • Cause: Site navigation changed during crawl
  • Solution: Re-run crawler after changes settle
  • Prevention: Schedule crawls during maintenance windows

Website-Specific Solutions

WordPress Sites

Common Issues:

  • WP admin pages getting crawled
  • Plugin pages creating noise
  • Duplicate content from categories/tags

Recommended Exclusions:

URL Starts With: https://yourcompany.com/wp-admin
URL Starts With: https://yourcompany.com/wp-login
URL Contains: /author/
URL Contains: /tag/
URL Contains: /date/
URL Contains: ?p=

Shopify Stores

Common Issues:

  • Product variants creating duplicate URLs
  • Customer account pages
  • Checkout process pages

Recommended Exclusions:

URL Contains: /account
URL Contains: /cart
URL Contains: /checkout
URL Contains: ?variant=
URL Starts With: https://yourcompany.com/admin

Documentation Sites

Common Issues:

  • Multiple versions creating confusion
  • API references in wrong format
  • Download files being indexed

Recommended Exclusions:

URL Contains: /v1/
URL Contains: /deprecated
URL Ends With: .pdf
URL Ends With: .zip
URL Contains: /download

Getting Help

When to Contact Support

Reach out to us at support@siteassist.io if:

  • Crawler consistently fails after trying solutions above
  • Your website has unique architecture needs
  • You need help setting up complex exclusion rules
  • Crawling works but AI responses are poor quality
  • You're seeing unexpected behavior not covered here

Information to Include

When contacting support, please provide:

  1. Your website URL
  2. Crawler configuration details
  3. Description of the problem
  4. Screenshots of crawler status/logs
  5. Expected vs. actual results

Quick Diagnostic Steps

Before contacting support, try:

  1. Test in Private Browser: Rule out caching issues
  2. Check Crawler Logs: Look for specific error messages
  3. Try Different Start URL: Test with a simpler starting point
  4. Disable All Exclusions: See if rules are causing issues
  5. Increase Max URLs: Rule out limit-related problems

Prevention Tips

Set Yourself Up for Success

Website Preparation:

  • Ensure clear navigation structure
  • Generate and maintain sitemap.xml
  • Use descriptive, crawlable URLs
  • Minimize JavaScript-dependent navigation

Crawler Configuration:

  • Start with conservative settings
  • Test with small Max URLs first
  • Add exclusions gradually
  • Monitor results after changes

Regular Maintenance:

  • Review crawler performance monthly
  • Update exclusions as site changes
  • Adjust frequency based on content updates
  • Test AI responses periodically

Remember: Most crawling issues have simple solutions. Start with the basics (URL format, exclusions, limits) before diving into complex configurations.