Files
2026-02-05 19:23:03 +05:30

14 KiB

🗺️ XML Sitemap Generator - Complete Implementation

Project Overview

A production-ready Go API for generating XML sitemaps with real-time progress tracking. Built with concurrent crawling, SSE streaming, and comprehensive client metadata tracking.

Key Features Implemented

1. Backend-Generated UUID System

  • Server generates unique UUID for each crawl request
  • UUID used for SSE stream connection and file download
  • Enables true multi-user support with isolated streams

2. Server-Sent Events (SSE) Streaming

  • Real-time progress updates via /stream/{uuid}
  • Event types: connected, started, progress, complete, error
  • Non-blocking concurrent stream management
  • Automatic cleanup after completion

3. Concurrent Web Crawler

  • Goroutine-based parallel crawling
  • Configurable concurrency limit (default: 5 parallel requests)
  • Depth-limited crawling (1-5 levels)
  • Same-domain restriction with URL normalization
  • Duplicate detection and prevention

4. Client Metadata Tracking

Automatically captured and stored in SQLite:

  • IP Address (with X-Forwarded-For support)
  • User-Agent string
  • Browser name & version (Chrome, Firefox, Safari, Edge, Opera)
  • Operating System (Windows, macOS, Linux, Android, iOS)
  • Device Type (Desktop, Mobile, Tablet)
  • Session ID (cookie-based persistence)
  • All cookies (JSON-encoded)
  • HTTP Referrer

5. RESTful API Endpoints

POST   /generate-sitemap-xml  → Start crawl, returns UUID
GET    /stream/{uuid}          → SSE progress stream
GET    /download/{uuid}        → Download XML sitemap
GET    /sites                  → List all sitemaps
GET    /sites/{id}             → Get specific site
DELETE /sites/{id}             → Delete sitemap
GET    /health                 → Health check
GET    /                       → Serve frontend HTML

6. Beautiful Frontend UI

  • Responsive gradient design
  • Real-time progress visualization
  • Live connection status indicator
  • Crawl statistics (pages found, depth, time)
  • Activity log with color-coded entries
  • Site management (view, download, delete)
  • Auto-protocol addition for URLs

🏗️ Architecture

┌─────────────┐
│   Browser   │
│  (Frontend) │
└──────┬──────┘
       │ POST /generate-sitemap-xml
       ↓
┌──────────────────────────────────┐
│   Go HTTP Server (Chi Router)   │
│                                  │
│  ┌────────────────────────────┐ │
│  │   Handler (handler.go)     │ │
│  │   - Generate UUID          │ │
│  │   - Extract metadata       │ │
│  │   - Create DB record       │ │
│  │   - Spawn crawler          │ │
│  │   - Return UUID immediately│ │
│  └─────────────┬──────────────┘ │
└────────────────┼────────────────┘
                 │
       ┌─────────┴─────────┐
       │                   │
       ↓                   ↓
┌──────────────┐   ┌───────────────┐
│ StreamManager│   │    Crawler    │
│              │   │               │
│ UUID → Chan  │   │  Goroutines   │
│ Map storage  │←──│  Concurrent   │
│              │   │  HTTP requests│
└──────┬───────┘   └───────┬───────┘
       │                   │
       │ SSE Events        │ Save pages
       ↓                   ↓
┌──────────────────────────────────┐
│         SQLite Database          │
│  - sites (with metadata)         │
│  - pages (discovered URLs)       │
│  - sessions (tracking)           │
└──────────────────────────────────┘

📂 File Structure

sitemap-api/
├── main.go                    # HTTP server setup, routes
├── go.mod                     # Go module dependencies
├── go.sum                     # Dependency checksums
│
├── handlers/
│   └── handler.go             # All HTTP handlers
│       - GenerateSitemapXML   # POST endpoint
│       - StreamSSE            # SSE streaming
│       - DownloadSitemap      # XML generation
│       - GetSites/GetSite     # CRUD operations
│       - DeleteSite           # Cleanup
│       - StreamManager        # Concurrent stream management
│
├── crawler/
│   └── crawler.go             # Web crawler implementation
│       - Crawl()              # Main crawl logic
│       - crawlURL()           # Recursive URL processing
│       - extractLinks()       # HTML parsing
│       - normalizeURL()       # URL canonicalization
│       - isSameDomain()       # Domain checking
│       - calculatePriority()  # Sitemap priority
│
├── database/
│   └── db.go                  # SQLite operations
│       - NewDB()              # Initialize DB
│       - createTables()       # Schema creation
│       - CreateSite()         # Insert site record
│       - GetSiteByUUID()      # Retrieve by UUID
│       - UpdateSiteStatus()   # Mark complete
│       - AddPage()            # Save discovered page
│       - GetPagesBySiteID()   # Retrieve all pages
│       - DeleteSite()         # Cascade delete
│
├── models/
│   └── site.go                # Data structures
│       - Site                 # Site record
│       - Page                 # Page record
│       - Event                # SSE event
│       - ProgressData         # Progress payload
│       - CompleteData         # Completion payload
│       - ErrorData            # Error payload
│
├── static/
│   └── index.html             # Frontend application
│       - SitemapGenerator     # Main class
│       - generateSitemap()    # Initiate crawl
│       - connectToStream()    # SSE connection
│       - updateProgress()     # Live updates
│       - downloadSitemap()    # File download
│       - displaySites()       # Results listing
│
├── README.md                  # Full documentation
├── QUICKSTART.md              # Quick start guide
├── Makefile                   # Build automation
├── Dockerfile                 # Container setup
├── run.sh                     # Startup script
├── .gitignore                 # Git exclusions
└── .env.example               # Environment template

🔄 Request Flow

1. Generate Sitemap Request

User fills form → POST /generate-sitemap-xml
                      ↓
            Server generates UUID
                      ↓
        Extract IP, UA, cookies, session
                      ↓
           Save to database (status: processing)
                      ↓
      Create SSE channel in StreamManager
                      ↓
    Spawn goroutine for crawler (non-blocking)
                      ↓
        Return UUID immediately to frontend

2. SSE Stream Connection

Frontend receives UUID → GET /stream/{uuid}
                             ↓
              StreamManager finds channel
                             ↓
               Send "connected" event
                             ↓
        Crawler sends events to channel
                             ↓
           Handler forwards to browser
                             ↓
        Frontend updates UI in real-time

3. Crawler Operation

Start from root URL → Fetch HTML
                         ↓
         Parse <a> tags for links
                         ↓
        Check: same domain? not visited?
                         ↓
    Save page to database (URL, depth, priority)
                         ↓
     Send "progress" event via channel
                         ↓
        Spawn goroutines for child URLs
                         ↓
    Repeat until max depth reached
                         ↓
        Send "complete" event
                         ↓
    Close channel, cleanup resources

4. Download Request

User clicks download → GET /download/{uuid}
                           ↓
             Lookup site by UUID
                           ↓
        Fetch all pages from database
                           ↓
         Generate XML sitemap
                           ↓
     Set Content-Disposition header
                           ↓
        Stream XML to browser

🔐 Security Considerations

Implemented

  • Same-domain restriction (no external crawling)
  • Max depth limit (prevents infinite loops)
  • HTTP timeout per request (10 seconds)
  • Duplicate URL prevention
  • SQLite prepared statements (SQL injection safe)
  • CORS middleware included
  • Rate limiting per IP
  • Authentication/API keys
  • Input validation & sanitization
  • Request size limits
  • robots.txt respect
  • User-Agent identification
  • HTTPS enforcement
  • Firewall rules

🚀 Performance Optimization

Current

  • Concurrent goroutines (5 parallel requests default)
  • Non-blocking SSE streams
  • Efficient channel-based communication
  • In-memory visited URL tracking
  • Database connection pooling

Possible Improvements

  • Redis for distributed crawling
  • Worker pool pattern
  • Content caching
  • Incremental sitemap updates
  • Compression for large sitemaps
  • Database indexing optimization

📊 Database Schema

sites table

- id (PK)              - Auto-increment
- uuid (UNIQUE)        - Server-generated UUID
- domain               - Extracted from URL
- url                  - Full starting URL
- max_depth            - Crawl depth limit
- page_count           - Total pages found
- status               - processing/completed/failed
- ip_address           - Client IP
- user_agent           - Full UA string
- browser              - Parsed browser name
- browser_version      - Version number
- os                   - Operating system
- device_type          - Desktop/Mobile/Tablet
- session_id           - Cookie-based session
- cookies              - JSON of all cookies
- referrer             - HTTP Referer header
- created_at           - Timestamp
- completed_at         - Completion timestamp
- last_crawled         - Last activity

pages table

- id (PK)              - Auto-increment
- site_id (FK)         - References sites(id)
- url                  - Page URL (UNIQUE)
- depth                - Crawl depth level
- last_modified        - Discovery time
- priority             - Sitemap priority (0.0-1.0)
- change_freq          - monthly/weekly/daily/etc

sessions table

- id (PK)              - Auto-increment
- session_id (UNIQUE)  - Session UUID
- uuid (FK)            - References sites(uuid)
- ip_address           - Client IP
- created_at           - First seen
- last_activity        - Last request

🧪 Testing

Manual Testing

# Terminal 1: Start server
./run.sh

# Terminal 2: Test API
curl -X POST http://localhost:8080/generate-sitemap-xml \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com","max_depth":2}'

# Terminal 3: Watch SSE stream
curl -N http://localhost:8080/stream/{uuid}

Browser Testing

  1. Open multiple tabs to http://localhost:8080
  2. Start different crawls simultaneously
  3. Verify independent progress tracking
  4. Check database for metadata

Database Verification

sqlite3 sitemap.db "SELECT * FROM sites ORDER BY created_at DESC LIMIT 5;"
sqlite3 sitemap.db "SELECT COUNT(*) FROM pages WHERE site_id = 1;"

📦 Deployment Options

Option 1: Binary

go build -o sitemap-api
./sitemap-api

Option 2: Docker

docker build -t sitemap-api .
docker run -p 8080:8080 sitemap-api

Option 3: Systemd Service

[Unit]
Description=Sitemap Generator API
After=network.target

[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/sitemap-api
ExecStart=/opt/sitemap-api/sitemap-api
Restart=always

[Install]
WantedBy=multi-user.target

🔧 Configuration

Environment Variables

export PORT=8080              # Server port
export DB_PATH=sitemap.db     # Database file

Code Constants

// crawler/crawler.go
const maxConcurrent = 5       // Parallel requests
const httpTimeout = 10        // Seconds

// handlers/handler.go
const channelBuffer = 100     // SSE event buffer

📝 XML Sitemap Format

Generated sitemaps follow the standard:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2024-02-05</lastmod>
    <changefreq>monthly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/about</loc>
    <lastmod>2024-02-05</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

🎯 Success Criteria

All requirements met:

  • Go backend with excellent performance
  • Endpoint: /generate-sitemap-xml with UUID response
  • Endpoint: /stream/{uuid} for SSE
  • Endpoint: /download/{uuid} for XML
  • Multi-user concurrent support
  • Client metadata tracking (IP, browser, cookies, session)
  • SQLite storage
  • Root route / serves HTML
  • Real-time progress updates
  • Clean, maintainable code structure

📚 Next Steps

To extend this project:

  1. Add user authentication (JWT tokens)
  2. Implement rate limiting (go-rate package)
  3. Add robots.txt parsing (robotstxt.go package)
  4. Support sitemap index for large sites
  5. Add scheduling/cron jobs for recurring crawls
  6. Implement incremental updates
  7. Add webhook notifications
  8. Create admin dashboard
  9. Export to other formats (JSON, CSV)
  10. Add analytics and usage stats

Ready to use! Just run ./run.sh or make run to get started.