Troubleshooting Crawlers
Solutions to common web crawler issues and problems
Troubleshooting Web Crawlers
Having trouble with your web crawler? This guide covers the most common issues and their solutions to get your crawler working smoothly.
Common Setup Issues
🚫 Invalid URL Format Error
Problem: Users enter URLs with https://
prefix, causing validation errors.
Why this happens:
- The Start URL field automatically adds
https://
prefix - Adding
https://
manually creates a double prefix (https://https://yoursite.com
) - System rejects malformed URLs
Solution:
❌ Wrong: https://yourcompany.com
✅ Correct: yourcompany.com
❌ Wrong: http://yourcompany.com
✅ Correct: yourcompany.com
❌ Wrong: https://docs.yourcompany.com/help
✅ Correct: docs.yourcompany.com/help
Quick Fix:
- Remove
https://
orhttp://
from your URL - Enter only the domain and path
- The system will automatically add the secure protocol
🔄 Single Page Application (SPA) Crawling Issues
Problem: Crawler finds very few pages on React, Vue, or Angular websites.
Why this happens:
- SPAs render content with JavaScript after page load
- Crawlers see the initial HTML without JavaScript-rendered content
- Most navigation happens client-side, not through traditional links
Symptoms:
- Only 1-2 pages discovered when you expect many more
- Missing content from dynamic sections
- Crawler stops quickly with low page count
Solutions:
Option 1: Use Sitemap (Recommended)
- Ensure your SPA generates a sitemap.xml
- Include all routes in your sitemap
- SiteAssist will use the sitemap for comprehensive crawling
Option 2: Server-Side Rendering (SSR)
- Implement Next.js, Nuxt.js, or similar SSR framework
- Ensures content is available when crawler visits
- Best long-term solution for SEO and crawling
Option 3: Prerendering
- Use tools like Prerender.io or Netlify's prerendering
- Generates static HTML for crawler visits
- Good compromise for existing SPAs
Option 4: Manual URL Management
- Use exclusion rules strategically
- Focus crawler on content-heavy sections
- Consider multiple targeted crawlers
Crawling Performance Issues
⏱️ Crawler Running Too Long
Problem: Crawler seems stuck or runs much longer than expected.
Possible Causes & Solutions:
Large Website
- Cause: Site has thousands of pages
- Solution: Set appropriate Max URLs limit (start with 1,000-2,000)
- Prevention: Use exclusion rules to focus on important content
Slow Server Response
- Cause: Your website responds slowly to requests
- Solution: Check your website's performance and hosting
- Workaround: Increase timeout patience, crawl during off-peak hours
Infinite URL Loops
- Cause: Dynamic URLs creating endless variations
- Solution: Add exclusion rules for URL parameters
URL Contains: ?page=
URL Contains: ?filter=
URL Contains: ?sort=
Deep Link Structure
- Cause: Very deep navigation hierarchies
- Solution: Start crawling from specific sections rather than root
📊 Very Low Page Discovery
Problem: Crawler finds far fewer pages than expected.
Common Causes & Fixes:
Poor Internal Linking
- Cause: Pages aren't linked from main navigation
- Solution: Improve site navigation or use sitemap crawling
- Check: Ensure important pages are linked from your homepage
Broken Internal Links
- Cause: Links pointing to non-existent pages
- Solution: Fix broken links in your website navigation
- Tool: Use website crawling tools to identify broken links
JavaScript-Heavy Navigation
- Cause: Navigation menu built entirely with JavaScript
- Solution: Add HTML fallback navigation or use sitemap
- Best Practice: Ensure critical pages have HTML links
Robots.txt Restrictions
- Cause: Your robots.txt file blocks crawling
- Solution: Check and adjust robots.txt permissions
- Test: Use Google Search Console to test robot access
Content Quality Issues
🗑️ Crawler Indexing Irrelevant Content
Problem: AI assistant knows about admin pages, shopping carts, or other irrelevant content.
Solution Strategy:
Add Strategic Exclusion Rules
Common exclusions for business sites:
- URL Starts With: https://yourcompany.com/admin
- URL Starts With: https://yourcompany.com/wp-admin
- URL Contains: /cart
- URL Contains: /checkout
- URL Contains: /login
- URL Contains: /register
- URL Ends With: .pdf
- URL Contains: ?print=
E-commerce Specific Exclusions
- URL Contains: /cart
- URL Contains: /checkout
- URL Contains: /account
- URL Contains: /wishlist
- URL Starts With: https://yourcompany.com/customer
- URL Contains: ?variant=
Review and Refine
- Run initial crawl to see what gets indexed
- Identify unwanted content in the logs
- Add specific exclusion rules
- Re-run crawler to test improvements
❌ Missing Important Pages
Problem: Key pages aren't being crawled and indexed.
Diagnostic Steps:
Check Start URL
- Ensure start URL links to or navigates to missing pages
- Try starting from a section closer to the missing content
Review Exclusion Rules
- Check if exclusion rules are too broad
- Temporarily disable rules to test
- Refine overly aggressive exclusions
Verify Page Accessibility
- Ensure pages are publicly accessible (no login required)
- Check that pages return 200 HTTP status
- Verify pages aren't blocked by robots.txt
Check Max URLs Limit
- Increase limit if crawler stops before reaching important pages
- Monitor crawl progress to see where it stops
Technical Issues
🚨 Crawler Fails to Start
Problem: Crawler won't begin crawling at all.
Troubleshooting Checklist:
URL Accessibility
Test your URL:
1. Open start URL in browser
2. Verify page loads completely
3. Check for password protection
4. Ensure SSL certificate is valid
Domain Configuration
- Verify domain is spelled correctly
- Check for typos in subdomain names
- Ensure DNS is properly configured
Server Issues
- Check if your website is currently down
- Verify hosting provider isn't blocking crawlers
- Test during different times if server is overloaded
⚠️ Partial Crawl Results
Problem: Crawler stops unexpectedly with partial results.
Common Causes:
Hit Max URLs Limit
- Check: Crawler status shows "Max URLs reached"
- Solution: Increase Max URLs limit or add exclusion rules
Server Rate Limiting
- Cause: Your server blocks rapid requests
- Solution: Contact support for crawler rate adjustment
- Prevention: Ensure hosting can handle crawler traffic
Website Structure Changes
- Cause: Site navigation changed during crawl
- Solution: Re-run crawler after changes settle
- Prevention: Schedule crawls during maintenance windows
Website-Specific Solutions
WordPress Sites
Common Issues:
- WP admin pages getting crawled
- Plugin pages creating noise
- Duplicate content from categories/tags
Recommended Exclusions:
URL Starts With: https://yourcompany.com/wp-admin
URL Starts With: https://yourcompany.com/wp-login
URL Contains: /author/
URL Contains: /tag/
URL Contains: /date/
URL Contains: ?p=
Shopify Stores
Common Issues:
- Product variants creating duplicate URLs
- Customer account pages
- Checkout process pages
Recommended Exclusions:
URL Contains: /account
URL Contains: /cart
URL Contains: /checkout
URL Contains: ?variant=
URL Starts With: https://yourcompany.com/admin
Documentation Sites
Common Issues:
- Multiple versions creating confusion
- API references in wrong format
- Download files being indexed
Recommended Exclusions:
URL Contains: /v1/
URL Contains: /deprecated
URL Ends With: .pdf
URL Ends With: .zip
URL Contains: /download
Getting Help
When to Contact Support
Reach out to us at support@siteassist.io if:
- Crawler consistently fails after trying solutions above
- Your website has unique architecture needs
- You need help setting up complex exclusion rules
- Crawling works but AI responses are poor quality
- You're seeing unexpected behavior not covered here
Information to Include
When contacting support, please provide:
- Your website URL
- Crawler configuration details
- Description of the problem
- Screenshots of crawler status/logs
- Expected vs. actual results
Quick Diagnostic Steps
Before contacting support, try:
- Test in Private Browser: Rule out caching issues
- Check Crawler Logs: Look for specific error messages
- Try Different Start URL: Test with a simpler starting point
- Disable All Exclusions: See if rules are causing issues
- Increase Max URLs: Rule out limit-related problems
Prevention Tips
Set Yourself Up for Success
Website Preparation:
- Ensure clear navigation structure
- Generate and maintain sitemap.xml
- Use descriptive, crawlable URLs
- Minimize JavaScript-dependent navigation
Crawler Configuration:
- Start with conservative settings
- Test with small Max URLs first
- Add exclusions gradually
- Monitor results after changes
Regular Maintenance:
- Review crawler performance monthly
- Update exclusions as site changes
- Adjust frequency based on content updates
- Test AI responses periodically
Remember: Most crawling issues have simple solutions. Start with the basics (URL format, exclusions, limits) before diving into complex configurations.