14 KiB
14 KiB
🗺️ XML Sitemap Generator - Complete Implementation
Project Overview
A production-ready Go API for generating XML sitemaps with real-time progress tracking. Built with concurrent crawling, SSE streaming, and comprehensive client metadata tracking.
✨ Key Features Implemented
1. Backend-Generated UUID System
- Server generates unique UUID for each crawl request
- UUID used for SSE stream connection and file download
- Enables true multi-user support with isolated streams
2. Server-Sent Events (SSE) Streaming
- Real-time progress updates via
/stream/{uuid} - Event types:
connected,started,progress,complete,error - Non-blocking concurrent stream management
- Automatic cleanup after completion
3. Concurrent Web Crawler
- Goroutine-based parallel crawling
- Configurable concurrency limit (default: 5 parallel requests)
- Depth-limited crawling (1-5 levels)
- Same-domain restriction with URL normalization
- Duplicate detection and prevention
4. Client Metadata Tracking
Automatically captured and stored in SQLite:
- IP Address (with X-Forwarded-For support)
- User-Agent string
- Browser name & version (Chrome, Firefox, Safari, Edge, Opera)
- Operating System (Windows, macOS, Linux, Android, iOS)
- Device Type (Desktop, Mobile, Tablet)
- Session ID (cookie-based persistence)
- All cookies (JSON-encoded)
- HTTP Referrer
5. RESTful API Endpoints
POST /generate-sitemap-xml → Start crawl, returns UUID
GET /stream/{uuid} → SSE progress stream
GET /download/{uuid} → Download XML sitemap
GET /sites → List all sitemaps
GET /sites/{id} → Get specific site
DELETE /sites/{id} → Delete sitemap
GET /health → Health check
GET / → Serve frontend HTML
6. Beautiful Frontend UI
- Responsive gradient design
- Real-time progress visualization
- Live connection status indicator
- Crawl statistics (pages found, depth, time)
- Activity log with color-coded entries
- Site management (view, download, delete)
- Auto-protocol addition for URLs
🏗️ Architecture
┌─────────────┐
│ Browser │
│ (Frontend) │
└──────┬──────┘
│ POST /generate-sitemap-xml
↓
┌──────────────────────────────────┐
│ Go HTTP Server (Chi Router) │
│ │
│ ┌────────────────────────────┐ │
│ │ Handler (handler.go) │ │
│ │ - Generate UUID │ │
│ │ - Extract metadata │ │
│ │ - Create DB record │ │
│ │ - Spawn crawler │ │
│ │ - Return UUID immediately│ │
│ └─────────────┬──────────────┘ │
└────────────────┼────────────────┘
│
┌─────────┴─────────┐
│ │
↓ ↓
┌──────────────┐ ┌───────────────┐
│ StreamManager│ │ Crawler │
│ │ │ │
│ UUID → Chan │ │ Goroutines │
│ Map storage │←──│ Concurrent │
│ │ │ HTTP requests│
└──────┬───────┘ └───────┬───────┘
│ │
│ SSE Events │ Save pages
↓ ↓
┌──────────────────────────────────┐
│ SQLite Database │
│ - sites (with metadata) │
│ - pages (discovered URLs) │
│ - sessions (tracking) │
└──────────────────────────────────┘
📂 File Structure
sitemap-api/
├── main.go # HTTP server setup, routes
├── go.mod # Go module dependencies
├── go.sum # Dependency checksums
│
├── handlers/
│ └── handler.go # All HTTP handlers
│ - GenerateSitemapXML # POST endpoint
│ - StreamSSE # SSE streaming
│ - DownloadSitemap # XML generation
│ - GetSites/GetSite # CRUD operations
│ - DeleteSite # Cleanup
│ - StreamManager # Concurrent stream management
│
├── crawler/
│ └── crawler.go # Web crawler implementation
│ - Crawl() # Main crawl logic
│ - crawlURL() # Recursive URL processing
│ - extractLinks() # HTML parsing
│ - normalizeURL() # URL canonicalization
│ - isSameDomain() # Domain checking
│ - calculatePriority() # Sitemap priority
│
├── database/
│ └── db.go # SQLite operations
│ - NewDB() # Initialize DB
│ - createTables() # Schema creation
│ - CreateSite() # Insert site record
│ - GetSiteByUUID() # Retrieve by UUID
│ - UpdateSiteStatus() # Mark complete
│ - AddPage() # Save discovered page
│ - GetPagesBySiteID() # Retrieve all pages
│ - DeleteSite() # Cascade delete
│
├── models/
│ └── site.go # Data structures
│ - Site # Site record
│ - Page # Page record
│ - Event # SSE event
│ - ProgressData # Progress payload
│ - CompleteData # Completion payload
│ - ErrorData # Error payload
│
├── static/
│ └── index.html # Frontend application
│ - SitemapGenerator # Main class
│ - generateSitemap() # Initiate crawl
│ - connectToStream() # SSE connection
│ - updateProgress() # Live updates
│ - downloadSitemap() # File download
│ - displaySites() # Results listing
│
├── README.md # Full documentation
├── QUICKSTART.md # Quick start guide
├── Makefile # Build automation
├── Dockerfile # Container setup
├── run.sh # Startup script
├── .gitignore # Git exclusions
└── .env.example # Environment template
🔄 Request Flow
1. Generate Sitemap Request
User fills form → POST /generate-sitemap-xml
↓
Server generates UUID
↓
Extract IP, UA, cookies, session
↓
Save to database (status: processing)
↓
Create SSE channel in StreamManager
↓
Spawn goroutine for crawler (non-blocking)
↓
Return UUID immediately to frontend
2. SSE Stream Connection
Frontend receives UUID → GET /stream/{uuid}
↓
StreamManager finds channel
↓
Send "connected" event
↓
Crawler sends events to channel
↓
Handler forwards to browser
↓
Frontend updates UI in real-time
3. Crawler Operation
Start from root URL → Fetch HTML
↓
Parse <a> tags for links
↓
Check: same domain? not visited?
↓
Save page to database (URL, depth, priority)
↓
Send "progress" event via channel
↓
Spawn goroutines for child URLs
↓
Repeat until max depth reached
↓
Send "complete" event
↓
Close channel, cleanup resources
4. Download Request
User clicks download → GET /download/{uuid}
↓
Lookup site by UUID
↓
Fetch all pages from database
↓
Generate XML sitemap
↓
Set Content-Disposition header
↓
Stream XML to browser
🔐 Security Considerations
Implemented
- ✅ Same-domain restriction (no external crawling)
- ✅ Max depth limit (prevents infinite loops)
- ✅ HTTP timeout per request (10 seconds)
- ✅ Duplicate URL prevention
- ✅ SQLite prepared statements (SQL injection safe)
- ✅ CORS middleware included
Recommended for Production
- Rate limiting per IP
- Authentication/API keys
- Input validation & sanitization
- Request size limits
- robots.txt respect
- User-Agent identification
- HTTPS enforcement
- Firewall rules
🚀 Performance Optimization
Current
- Concurrent goroutines (5 parallel requests default)
- Non-blocking SSE streams
- Efficient channel-based communication
- In-memory visited URL tracking
- Database connection pooling
Possible Improvements
- Redis for distributed crawling
- Worker pool pattern
- Content caching
- Incremental sitemap updates
- Compression for large sitemaps
- Database indexing optimization
📊 Database Schema
sites table
- id (PK) - Auto-increment
- uuid (UNIQUE) - Server-generated UUID
- domain - Extracted from URL
- url - Full starting URL
- max_depth - Crawl depth limit
- page_count - Total pages found
- status - processing/completed/failed
- ip_address - Client IP
- user_agent - Full UA string
- browser - Parsed browser name
- browser_version - Version number
- os - Operating system
- device_type - Desktop/Mobile/Tablet
- session_id - Cookie-based session
- cookies - JSON of all cookies
- referrer - HTTP Referer header
- created_at - Timestamp
- completed_at - Completion timestamp
- last_crawled - Last activity
pages table
- id (PK) - Auto-increment
- site_id (FK) - References sites(id)
- url - Page URL (UNIQUE)
- depth - Crawl depth level
- last_modified - Discovery time
- priority - Sitemap priority (0.0-1.0)
- change_freq - monthly/weekly/daily/etc
sessions table
- id (PK) - Auto-increment
- session_id (UNIQUE) - Session UUID
- uuid (FK) - References sites(uuid)
- ip_address - Client IP
- created_at - First seen
- last_activity - Last request
🧪 Testing
Manual Testing
# Terminal 1: Start server
./run.sh
# Terminal 2: Test API
curl -X POST http://localhost:8080/generate-sitemap-xml \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com","max_depth":2}'
# Terminal 3: Watch SSE stream
curl -N http://localhost:8080/stream/{uuid}
Browser Testing
- Open multiple tabs to http://localhost:8080
- Start different crawls simultaneously
- Verify independent progress tracking
- Check database for metadata
Database Verification
sqlite3 sitemap.db "SELECT * FROM sites ORDER BY created_at DESC LIMIT 5;"
sqlite3 sitemap.db "SELECT COUNT(*) FROM pages WHERE site_id = 1;"
📦 Deployment Options
Option 1: Binary
go build -o sitemap-api
./sitemap-api
Option 2: Docker
docker build -t sitemap-api .
docker run -p 8080:8080 sitemap-api
Option 3: Systemd Service
[Unit]
Description=Sitemap Generator API
After=network.target
[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/sitemap-api
ExecStart=/opt/sitemap-api/sitemap-api
Restart=always
[Install]
WantedBy=multi-user.target
🔧 Configuration
Environment Variables
export PORT=8080 # Server port
export DB_PATH=sitemap.db # Database file
Code Constants
// crawler/crawler.go
const maxConcurrent = 5 // Parallel requests
const httpTimeout = 10 // Seconds
// handlers/handler.go
const channelBuffer = 100 // SSE event buffer
📝 XML Sitemap Format
Generated sitemaps follow the standard:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2024-02-05</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/about</loc>
<lastmod>2024-02-05</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
🎯 Success Criteria
All requirements met:
- ✅ Go backend with excellent performance
- ✅ Endpoint:
/generate-sitemap-xmlwith UUID response - ✅ Endpoint:
/stream/{uuid}for SSE - ✅ Endpoint:
/download/{uuid}for XML - ✅ Multi-user concurrent support
- ✅ Client metadata tracking (IP, browser, cookies, session)
- ✅ SQLite storage
- ✅ Root route
/serves HTML - ✅ Real-time progress updates
- ✅ Clean, maintainable code structure
📚 Next Steps
To extend this project:
- Add user authentication (JWT tokens)
- Implement rate limiting (go-rate package)
- Add robots.txt parsing (robotstxt.go package)
- Support sitemap index for large sites
- Add scheduling/cron jobs for recurring crawls
- Implement incremental updates
- Add webhook notifications
- Create admin dashboard
- Export to other formats (JSON, CSV)
- Add analytics and usage stats
Ready to use! Just run ./run.sh or make run to get started.