# πŸ—ΊοΈ XML Sitemap Generator - Complete Implementation ## Project Overview A production-ready Go API for generating XML sitemaps with real-time progress tracking. Built with concurrent crawling, SSE streaming, and comprehensive client metadata tracking. ## ✨ Key Features Implemented ### 1. **Backend-Generated UUID System** - Server generates unique UUID for each crawl request - UUID used for SSE stream connection and file download - Enables true multi-user support with isolated streams ### 2. **Server-Sent Events (SSE) Streaming** - Real-time progress updates via `/stream/{uuid}` - Event types: `connected`, `started`, `progress`, `complete`, `error` - Non-blocking concurrent stream management - Automatic cleanup after completion ### 3. **Concurrent Web Crawler** - Goroutine-based parallel crawling - Configurable concurrency limit (default: 5 parallel requests) - Depth-limited crawling (1-5 levels) - Same-domain restriction with URL normalization - Duplicate detection and prevention ### 4. **Client Metadata Tracking** Automatically captured and stored in SQLite: - IP Address (with X-Forwarded-For support) - User-Agent string - Browser name & version (Chrome, Firefox, Safari, Edge, Opera) - Operating System (Windows, macOS, Linux, Android, iOS) - Device Type (Desktop, Mobile, Tablet) - Session ID (cookie-based persistence) - All cookies (JSON-encoded) - HTTP Referrer ### 5. **RESTful API Endpoints** ``` POST /generate-sitemap-xml β†’ Start crawl, returns UUID GET /stream/{uuid} β†’ SSE progress stream GET /download/{uuid} β†’ Download XML sitemap GET /sites β†’ List all sitemaps GET /sites/{id} β†’ Get specific site DELETE /sites/{id} β†’ Delete sitemap GET /health β†’ Health check GET / β†’ Serve frontend HTML ``` ### 6. **Beautiful Frontend UI** - Responsive gradient design - Real-time progress visualization - Live connection status indicator - Crawl statistics (pages found, depth, time) - Activity log with color-coded entries - Site management (view, download, delete) - Auto-protocol addition for URLs ## πŸ—οΈ Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Browser β”‚ β”‚ (Frontend) β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ POST /generate-sitemap-xml ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Go HTTP Server (Chi Router) β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Handler (handler.go) β”‚ β”‚ β”‚ β”‚ - Generate UUID β”‚ β”‚ β”‚ β”‚ - Extract metadata β”‚ β”‚ β”‚ β”‚ - Create DB record β”‚ β”‚ β”‚ β”‚ - Spawn crawler β”‚ β”‚ β”‚ β”‚ - Return UUID immediatelyβ”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ ↓ ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ StreamManagerβ”‚ β”‚ Crawler β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ UUID β†’ Chan β”‚ β”‚ Goroutines β”‚ β”‚ Map storage │←──│ Concurrent β”‚ β”‚ β”‚ β”‚ HTTP requestsβ”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ SSE Events β”‚ Save pages ↓ ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ SQLite Database β”‚ β”‚ - sites (with metadata) β”‚ β”‚ - pages (discovered URLs) β”‚ β”‚ - sessions (tracking) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## πŸ“‚ File Structure ``` sitemap-api/ β”œβ”€β”€ main.go # HTTP server setup, routes β”œβ”€β”€ go.mod # Go module dependencies β”œβ”€β”€ go.sum # Dependency checksums β”‚ β”œβ”€β”€ handlers/ β”‚ └── handler.go # All HTTP handlers β”‚ - GenerateSitemapXML # POST endpoint β”‚ - StreamSSE # SSE streaming β”‚ - DownloadSitemap # XML generation β”‚ - GetSites/GetSite # CRUD operations β”‚ - DeleteSite # Cleanup β”‚ - StreamManager # Concurrent stream management β”‚ β”œβ”€β”€ crawler/ β”‚ └── crawler.go # Web crawler implementation β”‚ - Crawl() # Main crawl logic β”‚ - crawlURL() # Recursive URL processing β”‚ - extractLinks() # HTML parsing β”‚ - normalizeURL() # URL canonicalization β”‚ - isSameDomain() # Domain checking β”‚ - calculatePriority() # Sitemap priority β”‚ β”œβ”€β”€ database/ β”‚ └── db.go # SQLite operations β”‚ - NewDB() # Initialize DB β”‚ - createTables() # Schema creation β”‚ - CreateSite() # Insert site record β”‚ - GetSiteByUUID() # Retrieve by UUID β”‚ - UpdateSiteStatus() # Mark complete β”‚ - AddPage() # Save discovered page β”‚ - GetPagesBySiteID() # Retrieve all pages β”‚ - DeleteSite() # Cascade delete β”‚ β”œβ”€β”€ models/ β”‚ └── site.go # Data structures β”‚ - Site # Site record β”‚ - Page # Page record β”‚ - Event # SSE event β”‚ - ProgressData # Progress payload β”‚ - CompleteData # Completion payload β”‚ - ErrorData # Error payload β”‚ β”œβ”€β”€ static/ β”‚ └── index.html # Frontend application β”‚ - SitemapGenerator # Main class β”‚ - generateSitemap() # Initiate crawl β”‚ - connectToStream() # SSE connection β”‚ - updateProgress() # Live updates β”‚ - downloadSitemap() # File download β”‚ - displaySites() # Results listing β”‚ β”œβ”€β”€ README.md # Full documentation β”œβ”€β”€ QUICKSTART.md # Quick start guide β”œβ”€β”€ Makefile # Build automation β”œβ”€β”€ Dockerfile # Container setup β”œβ”€β”€ run.sh # Startup script β”œβ”€β”€ .gitignore # Git exclusions └── .env.example # Environment template ``` ## πŸ”„ Request Flow ### 1. Generate Sitemap Request ``` User fills form β†’ POST /generate-sitemap-xml ↓ Server generates UUID ↓ Extract IP, UA, cookies, session ↓ Save to database (status: processing) ↓ Create SSE channel in StreamManager ↓ Spawn goroutine for crawler (non-blocking) ↓ Return UUID immediately to frontend ``` ### 2. SSE Stream Connection ``` Frontend receives UUID β†’ GET /stream/{uuid} ↓ StreamManager finds channel ↓ Send "connected" event ↓ Crawler sends events to channel ↓ Handler forwards to browser ↓ Frontend updates UI in real-time ``` ### 3. Crawler Operation ``` Start from root URL β†’ Fetch HTML ↓ Parse tags for links ↓ Check: same domain? not visited? ↓ Save page to database (URL, depth, priority) ↓ Send "progress" event via channel ↓ Spawn goroutines for child URLs ↓ Repeat until max depth reached ↓ Send "complete" event ↓ Close channel, cleanup resources ``` ### 4. Download Request ``` User clicks download β†’ GET /download/{uuid} ↓ Lookup site by UUID ↓ Fetch all pages from database ↓ Generate XML sitemap ↓ Set Content-Disposition header ↓ Stream XML to browser ``` ## πŸ” Security Considerations ### Implemented - βœ… Same-domain restriction (no external crawling) - βœ… Max depth limit (prevents infinite loops) - βœ… HTTP timeout per request (10 seconds) - βœ… Duplicate URL prevention - βœ… SQLite prepared statements (SQL injection safe) - βœ… CORS middleware included ### Recommended for Production - [ ] Rate limiting per IP - [ ] Authentication/API keys - [ ] Input validation & sanitization - [ ] Request size limits - [ ] robots.txt respect - [ ] User-Agent identification - [ ] HTTPS enforcement - [ ] Firewall rules ## πŸš€ Performance Optimization ### Current - Concurrent goroutines (5 parallel requests default) - Non-blocking SSE streams - Efficient channel-based communication - In-memory visited URL tracking - Database connection pooling ### Possible Improvements - Redis for distributed crawling - Worker pool pattern - Content caching - Incremental sitemap updates - Compression for large sitemaps - Database indexing optimization ## πŸ“Š Database Schema ### sites table ```sql - id (PK) - Auto-increment - uuid (UNIQUE) - Server-generated UUID - domain - Extracted from URL - url - Full starting URL - max_depth - Crawl depth limit - page_count - Total pages found - status - processing/completed/failed - ip_address - Client IP - user_agent - Full UA string - browser - Parsed browser name - browser_version - Version number - os - Operating system - device_type - Desktop/Mobile/Tablet - session_id - Cookie-based session - cookies - JSON of all cookies - referrer - HTTP Referer header - created_at - Timestamp - completed_at - Completion timestamp - last_crawled - Last activity ``` ### pages table ```sql - id (PK) - Auto-increment - site_id (FK) - References sites(id) - url - Page URL (UNIQUE) - depth - Crawl depth level - last_modified - Discovery time - priority - Sitemap priority (0.0-1.0) - change_freq - monthly/weekly/daily/etc ``` ### sessions table ```sql - id (PK) - Auto-increment - session_id (UNIQUE) - Session UUID - uuid (FK) - References sites(uuid) - ip_address - Client IP - created_at - First seen - last_activity - Last request ``` ## πŸ§ͺ Testing ### Manual Testing ```bash # Terminal 1: Start server ./run.sh # Terminal 2: Test API curl -X POST http://localhost:8080/generate-sitemap-xml \ -H "Content-Type: application/json" \ -d '{"url":"https://example.com","max_depth":2}' # Terminal 3: Watch SSE stream curl -N http://localhost:8080/stream/{uuid} ``` ### Browser Testing 1. Open multiple tabs to http://localhost:8080 2. Start different crawls simultaneously 3. Verify independent progress tracking 4. Check database for metadata ### Database Verification ```bash sqlite3 sitemap.db "SELECT * FROM sites ORDER BY created_at DESC LIMIT 5;" sqlite3 sitemap.db "SELECT COUNT(*) FROM pages WHERE site_id = 1;" ``` ## πŸ“¦ Deployment Options ### Option 1: Binary ```bash go build -o sitemap-api ./sitemap-api ``` ### Option 2: Docker ```bash docker build -t sitemap-api . docker run -p 8080:8080 sitemap-api ``` ### Option 3: Systemd Service ```ini [Unit] Description=Sitemap Generator API After=network.target [Service] Type=simple User=www-data WorkingDirectory=/opt/sitemap-api ExecStart=/opt/sitemap-api/sitemap-api Restart=always [Install] WantedBy=multi-user.target ``` ## πŸ”§ Configuration ### Environment Variables ```bash export PORT=8080 # Server port export DB_PATH=sitemap.db # Database file ``` ### Code Constants ```go // crawler/crawler.go const maxConcurrent = 5 // Parallel requests const httpTimeout = 10 // Seconds // handlers/handler.go const channelBuffer = 100 // SSE event buffer ``` ## πŸ“ XML Sitemap Format Generated sitemaps follow the standard: ```xml https://example.com/ 2024-02-05 monthly 1.0 https://example.com/about 2024-02-05 monthly 0.8 ``` ## 🎯 Success Criteria All requirements met: - βœ… Go backend with excellent performance - βœ… Endpoint: `/generate-sitemap-xml` with UUID response - βœ… Endpoint: `/stream/{uuid}` for SSE - βœ… Endpoint: `/download/{uuid}` for XML - βœ… Multi-user concurrent support - βœ… Client metadata tracking (IP, browser, cookies, session) - βœ… SQLite storage - βœ… Root route `/` serves HTML - βœ… Real-time progress updates - βœ… Clean, maintainable code structure ## πŸ“š Next Steps To extend this project: 1. Add user authentication (JWT tokens) 2. Implement rate limiting (go-rate package) 3. Add robots.txt parsing (robotstxt.go package) 4. Support sitemap index for large sites 5. Add scheduling/cron jobs for recurring crawls 6. Implement incremental updates 7. Add webhook notifications 8. Create admin dashboard 9. Export to other formats (JSON, CSV) 10. Add analytics and usage stats --- **Ready to use! Just run `./run.sh` or `make run` to get started.**