Files

Kar@k5 b80e988191 init

2026-02-05 19:23:03 +05:30

14 KiB

Raw Permalink Blame History

🗺️ XML Sitemap Generator - Complete Implementation

Project Overview

A production-ready Go API for generating XML sitemaps with real-time progress tracking. Built with concurrent crawling, SSE streaming, and comprehensive client metadata tracking.

✨ Key Features Implemented

1. Backend-Generated UUID System

Server generates unique UUID for each crawl request
UUID used for SSE stream connection and file download
Enables true multi-user support with isolated streams

2. Server-Sent Events (SSE) Streaming

Real-time progress updates via /stream/{uuid}
Event types: connected, started, progress, complete, error
Non-blocking concurrent stream management
Automatic cleanup after completion

3. Concurrent Web Crawler

Goroutine-based parallel crawling
Configurable concurrency limit (default: 5 parallel requests)
Depth-limited crawling (1-5 levels)
Same-domain restriction with URL normalization
Duplicate detection and prevention

4. Client Metadata Tracking

Automatically captured and stored in SQLite:

IP Address (with X-Forwarded-For support)
User-Agent string
Browser name & version (Chrome, Firefox, Safari, Edge, Opera)
Operating System (Windows, macOS, Linux, Android, iOS)
Device Type (Desktop, Mobile, Tablet)
Session ID (cookie-based persistence)
All cookies (JSON-encoded)
HTTP Referrer

5. RESTful API Endpoints

POST   /generate-sitemap-xml  → Start crawl, returns UUID
GET    /stream/{uuid}          → SSE progress stream
GET    /download/{uuid}        → Download XML sitemap
GET    /sites                  → List all sitemaps
GET    /sites/{id}             → Get specific site
DELETE /sites/{id}             → Delete sitemap
GET    /health                 → Health check
GET    /                       → Serve frontend HTML

6. Beautiful Frontend UI

Responsive gradient design
Real-time progress visualization
Live connection status indicator
Crawl statistics (pages found, depth, time)
Activity log with color-coded entries
Site management (view, download, delete)
Auto-protocol addition for URLs

🏗️ Architecture

┌─────────────┐
│   Browser   │
│  (Frontend) │
└──────┬──────┘
       │ POST /generate-sitemap-xml
       ↓
┌──────────────────────────────────┐
│   Go HTTP Server (Chi Router)   │
│                                  │
│  ┌────────────────────────────┐ │
│  │   Handler (handler.go)     │ │
│  │   - Generate UUID          │ │
│  │   - Extract metadata       │ │
│  │   - Create DB record       │ │
│  │   - Spawn crawler          │ │
│  │   - Return UUID immediately│ │
│  └─────────────┬──────────────┘ │
└────────────────┼────────────────┘
                 │
       ┌─────────┴─────────┐
       │                   │
       ↓                   ↓
┌──────────────┐   ┌───────────────┐
│ StreamManager│   │    Crawler    │
│              │   │               │
│ UUID → Chan  │   │  Goroutines   │
│ Map storage  │←──│  Concurrent   │
│              │   │  HTTP requests│
└──────┬───────┘   └───────┬───────┘
       │                   │
       │ SSE Events        │ Save pages
       ↓                   ↓
┌──────────────────────────────────┐
│         SQLite Database          │
│  - sites (with metadata)         │
│  - pages (discovered URLs)       │
│  - sessions (tracking)           │
└──────────────────────────────────┘

📂 File Structure

sitemap-api/
├── main.go                    # HTTP server setup, routes
├── go.mod                     # Go module dependencies
├── go.sum                     # Dependency checksums
│
├── handlers/
│   └── handler.go             # All HTTP handlers
│       - GenerateSitemapXML   # POST endpoint
│       - StreamSSE            # SSE streaming
│       - DownloadSitemap      # XML generation
│       - GetSites/GetSite     # CRUD operations
│       - DeleteSite           # Cleanup
│       - StreamManager        # Concurrent stream management
│
├── crawler/
│   └── crawler.go             # Web crawler implementation
│       - Crawl()              # Main crawl logic
│       - crawlURL()           # Recursive URL processing
│       - extractLinks()       # HTML parsing
│       - normalizeURL()       # URL canonicalization
│       - isSameDomain()       # Domain checking
│       - calculatePriority()  # Sitemap priority
│
├── database/
│   └── db.go                  # SQLite operations
│       - NewDB()              # Initialize DB
│       - createTables()       # Schema creation
│       - CreateSite()         # Insert site record
│       - GetSiteByUUID()      # Retrieve by UUID
│       - UpdateSiteStatus()   # Mark complete
│       - AddPage()            # Save discovered page
│       - GetPagesBySiteID()   # Retrieve all pages
│       - DeleteSite()         # Cascade delete
│
├── models/
│   └── site.go                # Data structures
│       - Site                 # Site record
│       - Page                 # Page record
│       - Event                # SSE event
│       - ProgressData         # Progress payload
│       - CompleteData         # Completion payload
│       - ErrorData            # Error payload
│
├── static/
│   └── index.html             # Frontend application
│       - SitemapGenerator     # Main class
│       - generateSitemap()    # Initiate crawl
│       - connectToStream()    # SSE connection
│       - updateProgress()     # Live updates
│       - downloadSitemap()    # File download
│       - displaySites()       # Results listing
│
├── README.md                  # Full documentation
├── QUICKSTART.md              # Quick start guide
├── Makefile                   # Build automation
├── Dockerfile                 # Container setup
├── run.sh                     # Startup script
├── .gitignore                 # Git exclusions
└── .env.example               # Environment template

🔄 Request Flow

1. Generate Sitemap Request

User fills form → POST /generate-sitemap-xml
                      ↓
            Server generates UUID
                      ↓
        Extract IP, UA, cookies, session
                      ↓
           Save to database (status: processing)
                      ↓
      Create SSE channel in StreamManager
                      ↓
    Spawn goroutine for crawler (non-blocking)
                      ↓
        Return UUID immediately to frontend

2. SSE Stream Connection

Frontend receives UUID → GET /stream/{uuid}
                             ↓
              StreamManager finds channel
                             ↓
               Send "connected" event
                             ↓
        Crawler sends events to channel
                             ↓
           Handler forwards to browser
                             ↓
        Frontend updates UI in real-time

3. Crawler Operation

Start from root URL → Fetch HTML
                         ↓
         Parse <a> tags for links
                         ↓
        Check: same domain? not visited?
                         ↓
    Save page to database (URL, depth, priority)
                         ↓
     Send "progress" event via channel
                         ↓
        Spawn goroutines for child URLs
                         ↓
    Repeat until max depth reached
                         ↓
        Send "complete" event
                         ↓
    Close channel, cleanup resources

4. Download Request

User clicks download → GET /download/{uuid}
                           ↓
             Lookup site by UUID
                           ↓
        Fetch all pages from database
                           ↓
         Generate XML sitemap
                           ↓
     Set Content-Disposition header
                           ↓
        Stream XML to browser

🔐 Security Considerations

Implemented

✅ Same-domain restriction (no external crawling)
✅ Max depth limit (prevents infinite loops)
✅ HTTP timeout per request (10 seconds)
✅ Duplicate URL prevention
✅ SQLite prepared statements (SQL injection safe)
✅ CORS middleware included

Recommended for Production

Rate limiting per IP
Authentication/API keys
Input validation & sanitization
Request size limits
robots.txt respect
User-Agent identification
HTTPS enforcement
Firewall rules

🚀 Performance Optimization

Current

Concurrent goroutines (5 parallel requests default)
Non-blocking SSE streams
Efficient channel-based communication
In-memory visited URL tracking
Database connection pooling

Possible Improvements

Redis for distributed crawling
Worker pool pattern
Content caching
Incremental sitemap updates
Compression for large sitemaps
Database indexing optimization

📊 Database Schema

sites table

- id (PK)              - Auto-increment
- uuid (UNIQUE)        - Server-generated UUID
- domain               - Extracted from URL
- url                  - Full starting URL
- max_depth            - Crawl depth limit
- page_count           - Total pages found
- status               - processing/completed/failed
- ip_address           - Client IP
- user_agent           - Full UA string
- browser              - Parsed browser name
- browser_version      - Version number
- os                   - Operating system
- device_type          - Desktop/Mobile/Tablet
- session_id           - Cookie-based session
- cookies              - JSON of all cookies
- referrer             - HTTP Referer header
- created_at           - Timestamp
- completed_at         - Completion timestamp
- last_crawled         - Last activity

pages table

- id (PK)              - Auto-increment
- site_id (FK)         - References sites(id)
- url                  - Page URL (UNIQUE)
- depth                - Crawl depth level
- last_modified        - Discovery time
- priority             - Sitemap priority (0.0-1.0)
- change_freq          - monthly/weekly/daily/etc

sessions table

- id (PK)              - Auto-increment
- session_id (UNIQUE)  - Session UUID
- uuid (FK)            - References sites(uuid)
- ip_address           - Client IP
- created_at           - First seen
- last_activity        - Last request

🧪 Testing

Manual Testing

# Terminal 1: Start server
./run.sh

# Terminal 2: Test API
curl -X POST http://localhost:8080/generate-sitemap-xml \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com","max_depth":2}'

# Terminal 3: Watch SSE stream
curl -N http://localhost:8080/stream/{uuid}

Browser Testing

Open multiple tabs to http://localhost:8080
Start different crawls simultaneously
Verify independent progress tracking
Check database for metadata

Database Verification

sqlite3 sitemap.db "SELECT * FROM sites ORDER BY created_at DESC LIMIT 5;"
sqlite3 sitemap.db "SELECT COUNT(*) FROM pages WHERE site_id = 1;"

📦 Deployment Options

Option 1: Binary

go build -o sitemap-api
./sitemap-api

Option 2: Docker

docker build -t sitemap-api .
docker run -p 8080:8080 sitemap-api

Option 3: Systemd Service

[Unit]
Description=Sitemap Generator API
After=network.target

[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/sitemap-api
ExecStart=/opt/sitemap-api/sitemap-api
Restart=always

[Install]
WantedBy=multi-user.target

🔧 Configuration

Environment Variables

export PORT=8080              # Server port
export DB_PATH=sitemap.db     # Database file

Code Constants

// crawler/crawler.go
const maxConcurrent = 5       // Parallel requests
const httpTimeout = 10        // Seconds

// handlers/handler.go
const channelBuffer = 100     // SSE event buffer

📝 XML Sitemap Format

Generated sitemaps follow the standard:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2024-02-05</lastmod>
    <changefreq>monthly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/about</loc>
    <lastmod>2024-02-05</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

🎯 Success Criteria

All requirements met:

✅ Go backend with excellent performance
✅ Endpoint: /generate-sitemap-xml with UUID response
✅ Endpoint: /stream/{uuid} for SSE
✅ Endpoint: /download/{uuid} for XML
✅ Multi-user concurrent support
✅ Client metadata tracking (IP, browser, cookies, session)
✅ SQLite storage
✅ Root route / serves HTML
✅ Real-time progress updates
✅ Clean, maintainable code structure

📚 Next Steps

To extend this project:

Add user authentication (JWT tokens)
Implement rate limiting (go-rate package)
Add robots.txt parsing (robotstxt.go package)
Support sitemap index for large sites
Add scheduling/cron jobs for recurring crawls
Implement incremental updates
Add webhook notifications
Create admin dashboard
Export to other formats (JSON, CSV)
Add analytics and usage stats

Ready to use! Just run ./run.sh or make run to get started.

14 KiB Raw Permalink Blame History