448 lines
14 KiB
Markdown
448 lines
14 KiB
Markdown
# 🗺️ XML Sitemap Generator - Complete Implementation
|
|
|
|
## Project Overview
|
|
|
|
A production-ready Go API for generating XML sitemaps with real-time progress tracking. Built with concurrent crawling, SSE streaming, and comprehensive client metadata tracking.
|
|
|
|
## ✨ Key Features Implemented
|
|
|
|
### 1. **Backend-Generated UUID System**
|
|
- Server generates unique UUID for each crawl request
|
|
- UUID used for SSE stream connection and file download
|
|
- Enables true multi-user support with isolated streams
|
|
|
|
### 2. **Server-Sent Events (SSE) Streaming**
|
|
- Real-time progress updates via `/stream/{uuid}`
|
|
- Event types: `connected`, `started`, `progress`, `complete`, `error`
|
|
- Non-blocking concurrent stream management
|
|
- Automatic cleanup after completion
|
|
|
|
### 3. **Concurrent Web Crawler**
|
|
- Goroutine-based parallel crawling
|
|
- Configurable concurrency limit (default: 5 parallel requests)
|
|
- Depth-limited crawling (1-5 levels)
|
|
- Same-domain restriction with URL normalization
|
|
- Duplicate detection and prevention
|
|
|
|
### 4. **Client Metadata Tracking**
|
|
Automatically captured and stored in SQLite:
|
|
- IP Address (with X-Forwarded-For support)
|
|
- User-Agent string
|
|
- Browser name & version (Chrome, Firefox, Safari, Edge, Opera)
|
|
- Operating System (Windows, macOS, Linux, Android, iOS)
|
|
- Device Type (Desktop, Mobile, Tablet)
|
|
- Session ID (cookie-based persistence)
|
|
- All cookies (JSON-encoded)
|
|
- HTTP Referrer
|
|
|
|
### 5. **RESTful API Endpoints**
|
|
```
|
|
POST /generate-sitemap-xml → Start crawl, returns UUID
|
|
GET /stream/{uuid} → SSE progress stream
|
|
GET /download/{uuid} → Download XML sitemap
|
|
GET /sites → List all sitemaps
|
|
GET /sites/{id} → Get specific site
|
|
DELETE /sites/{id} → Delete sitemap
|
|
GET /health → Health check
|
|
GET / → Serve frontend HTML
|
|
```
|
|
|
|
### 6. **Beautiful Frontend UI**
|
|
- Responsive gradient design
|
|
- Real-time progress visualization
|
|
- Live connection status indicator
|
|
- Crawl statistics (pages found, depth, time)
|
|
- Activity log with color-coded entries
|
|
- Site management (view, download, delete)
|
|
- Auto-protocol addition for URLs
|
|
|
|
## 🏗️ Architecture
|
|
|
|
```
|
|
┌─────────────┐
|
|
│ Browser │
|
|
│ (Frontend) │
|
|
└──────┬──────┘
|
|
│ POST /generate-sitemap-xml
|
|
↓
|
|
┌──────────────────────────────────┐
|
|
│ Go HTTP Server (Chi Router) │
|
|
│ │
|
|
│ ┌────────────────────────────┐ │
|
|
│ │ Handler (handler.go) │ │
|
|
│ │ - Generate UUID │ │
|
|
│ │ - Extract metadata │ │
|
|
│ │ - Create DB record │ │
|
|
│ │ - Spawn crawler │ │
|
|
│ │ - Return UUID immediately│ │
|
|
│ └─────────────┬──────────────┘ │
|
|
└────────────────┼────────────────┘
|
|
│
|
|
┌─────────┴─────────┐
|
|
│ │
|
|
↓ ↓
|
|
┌──────────────┐ ┌───────────────┐
|
|
│ StreamManager│ │ Crawler │
|
|
│ │ │ │
|
|
│ UUID → Chan │ │ Goroutines │
|
|
│ Map storage │←──│ Concurrent │
|
|
│ │ │ HTTP requests│
|
|
└──────┬───────┘ └───────┬───────┘
|
|
│ │
|
|
│ SSE Events │ Save pages
|
|
↓ ↓
|
|
┌──────────────────────────────────┐
|
|
│ SQLite Database │
|
|
│ - sites (with metadata) │
|
|
│ - pages (discovered URLs) │
|
|
│ - sessions (tracking) │
|
|
└──────────────────────────────────┘
|
|
```
|
|
|
|
## 📂 File Structure
|
|
|
|
```
|
|
sitemap-api/
|
|
├── main.go # HTTP server setup, routes
|
|
├── go.mod # Go module dependencies
|
|
├── go.sum # Dependency checksums
|
|
│
|
|
├── handlers/
|
|
│ └── handler.go # All HTTP handlers
|
|
│ - GenerateSitemapXML # POST endpoint
|
|
│ - StreamSSE # SSE streaming
|
|
│ - DownloadSitemap # XML generation
|
|
│ - GetSites/GetSite # CRUD operations
|
|
│ - DeleteSite # Cleanup
|
|
│ - StreamManager # Concurrent stream management
|
|
│
|
|
├── crawler/
|
|
│ └── crawler.go # Web crawler implementation
|
|
│ - Crawl() # Main crawl logic
|
|
│ - crawlURL() # Recursive URL processing
|
|
│ - extractLinks() # HTML parsing
|
|
│ - normalizeURL() # URL canonicalization
|
|
│ - isSameDomain() # Domain checking
|
|
│ - calculatePriority() # Sitemap priority
|
|
│
|
|
├── database/
|
|
│ └── db.go # SQLite operations
|
|
│ - NewDB() # Initialize DB
|
|
│ - createTables() # Schema creation
|
|
│ - CreateSite() # Insert site record
|
|
│ - GetSiteByUUID() # Retrieve by UUID
|
|
│ - UpdateSiteStatus() # Mark complete
|
|
│ - AddPage() # Save discovered page
|
|
│ - GetPagesBySiteID() # Retrieve all pages
|
|
│ - DeleteSite() # Cascade delete
|
|
│
|
|
├── models/
|
|
│ └── site.go # Data structures
|
|
│ - Site # Site record
|
|
│ - Page # Page record
|
|
│ - Event # SSE event
|
|
│ - ProgressData # Progress payload
|
|
│ - CompleteData # Completion payload
|
|
│ - ErrorData # Error payload
|
|
│
|
|
├── static/
|
|
│ └── index.html # Frontend application
|
|
│ - SitemapGenerator # Main class
|
|
│ - generateSitemap() # Initiate crawl
|
|
│ - connectToStream() # SSE connection
|
|
│ - updateProgress() # Live updates
|
|
│ - downloadSitemap() # File download
|
|
│ - displaySites() # Results listing
|
|
│
|
|
├── README.md # Full documentation
|
|
├── QUICKSTART.md # Quick start guide
|
|
├── Makefile # Build automation
|
|
├── Dockerfile # Container setup
|
|
├── run.sh # Startup script
|
|
├── .gitignore # Git exclusions
|
|
└── .env.example # Environment template
|
|
```
|
|
|
|
## 🔄 Request Flow
|
|
|
|
### 1. Generate Sitemap Request
|
|
```
|
|
User fills form → POST /generate-sitemap-xml
|
|
↓
|
|
Server generates UUID
|
|
↓
|
|
Extract IP, UA, cookies, session
|
|
↓
|
|
Save to database (status: processing)
|
|
↓
|
|
Create SSE channel in StreamManager
|
|
↓
|
|
Spawn goroutine for crawler (non-blocking)
|
|
↓
|
|
Return UUID immediately to frontend
|
|
```
|
|
|
|
### 2. SSE Stream Connection
|
|
```
|
|
Frontend receives UUID → GET /stream/{uuid}
|
|
↓
|
|
StreamManager finds channel
|
|
↓
|
|
Send "connected" event
|
|
↓
|
|
Crawler sends events to channel
|
|
↓
|
|
Handler forwards to browser
|
|
↓
|
|
Frontend updates UI in real-time
|
|
```
|
|
|
|
### 3. Crawler Operation
|
|
```
|
|
Start from root URL → Fetch HTML
|
|
↓
|
|
Parse <a> tags for links
|
|
↓
|
|
Check: same domain? not visited?
|
|
↓
|
|
Save page to database (URL, depth, priority)
|
|
↓
|
|
Send "progress" event via channel
|
|
↓
|
|
Spawn goroutines for child URLs
|
|
↓
|
|
Repeat until max depth reached
|
|
↓
|
|
Send "complete" event
|
|
↓
|
|
Close channel, cleanup resources
|
|
```
|
|
|
|
### 4. Download Request
|
|
```
|
|
User clicks download → GET /download/{uuid}
|
|
↓
|
|
Lookup site by UUID
|
|
↓
|
|
Fetch all pages from database
|
|
↓
|
|
Generate XML sitemap
|
|
↓
|
|
Set Content-Disposition header
|
|
↓
|
|
Stream XML to browser
|
|
```
|
|
|
|
## 🔐 Security Considerations
|
|
|
|
### Implemented
|
|
- ✅ Same-domain restriction (no external crawling)
|
|
- ✅ Max depth limit (prevents infinite loops)
|
|
- ✅ HTTP timeout per request (10 seconds)
|
|
- ✅ Duplicate URL prevention
|
|
- ✅ SQLite prepared statements (SQL injection safe)
|
|
- ✅ CORS middleware included
|
|
|
|
### Recommended for Production
|
|
- [ ] Rate limiting per IP
|
|
- [ ] Authentication/API keys
|
|
- [ ] Input validation & sanitization
|
|
- [ ] Request size limits
|
|
- [ ] robots.txt respect
|
|
- [ ] User-Agent identification
|
|
- [ ] HTTPS enforcement
|
|
- [ ] Firewall rules
|
|
|
|
## 🚀 Performance Optimization
|
|
|
|
### Current
|
|
- Concurrent goroutines (5 parallel requests default)
|
|
- Non-blocking SSE streams
|
|
- Efficient channel-based communication
|
|
- In-memory visited URL tracking
|
|
- Database connection pooling
|
|
|
|
### Possible Improvements
|
|
- Redis for distributed crawling
|
|
- Worker pool pattern
|
|
- Content caching
|
|
- Incremental sitemap updates
|
|
- Compression for large sitemaps
|
|
- Database indexing optimization
|
|
|
|
## 📊 Database Schema
|
|
|
|
### sites table
|
|
```sql
|
|
- id (PK) - Auto-increment
|
|
- uuid (UNIQUE) - Server-generated UUID
|
|
- domain - Extracted from URL
|
|
- url - Full starting URL
|
|
- max_depth - Crawl depth limit
|
|
- page_count - Total pages found
|
|
- status - processing/completed/failed
|
|
- ip_address - Client IP
|
|
- user_agent - Full UA string
|
|
- browser - Parsed browser name
|
|
- browser_version - Version number
|
|
- os - Operating system
|
|
- device_type - Desktop/Mobile/Tablet
|
|
- session_id - Cookie-based session
|
|
- cookies - JSON of all cookies
|
|
- referrer - HTTP Referer header
|
|
- created_at - Timestamp
|
|
- completed_at - Completion timestamp
|
|
- last_crawled - Last activity
|
|
```
|
|
|
|
### pages table
|
|
```sql
|
|
- id (PK) - Auto-increment
|
|
- site_id (FK) - References sites(id)
|
|
- url - Page URL (UNIQUE)
|
|
- depth - Crawl depth level
|
|
- last_modified - Discovery time
|
|
- priority - Sitemap priority (0.0-1.0)
|
|
- change_freq - monthly/weekly/daily/etc
|
|
```
|
|
|
|
### sessions table
|
|
```sql
|
|
- id (PK) - Auto-increment
|
|
- session_id (UNIQUE) - Session UUID
|
|
- uuid (FK) - References sites(uuid)
|
|
- ip_address - Client IP
|
|
- created_at - First seen
|
|
- last_activity - Last request
|
|
```
|
|
|
|
## 🧪 Testing
|
|
|
|
### Manual Testing
|
|
```bash
|
|
# Terminal 1: Start server
|
|
./run.sh
|
|
|
|
# Terminal 2: Test API
|
|
curl -X POST http://localhost:8080/generate-sitemap-xml \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"url":"https://example.com","max_depth":2}'
|
|
|
|
# Terminal 3: Watch SSE stream
|
|
curl -N http://localhost:8080/stream/{uuid}
|
|
```
|
|
|
|
### Browser Testing
|
|
1. Open multiple tabs to http://localhost:8080
|
|
2. Start different crawls simultaneously
|
|
3. Verify independent progress tracking
|
|
4. Check database for metadata
|
|
|
|
### Database Verification
|
|
```bash
|
|
sqlite3 sitemap.db "SELECT * FROM sites ORDER BY created_at DESC LIMIT 5;"
|
|
sqlite3 sitemap.db "SELECT COUNT(*) FROM pages WHERE site_id = 1;"
|
|
```
|
|
|
|
## 📦 Deployment Options
|
|
|
|
### Option 1: Binary
|
|
```bash
|
|
go build -o sitemap-api
|
|
./sitemap-api
|
|
```
|
|
|
|
### Option 2: Docker
|
|
```bash
|
|
docker build -t sitemap-api .
|
|
docker run -p 8080:8080 sitemap-api
|
|
```
|
|
|
|
### Option 3: Systemd Service
|
|
```ini
|
|
[Unit]
|
|
Description=Sitemap Generator API
|
|
After=network.target
|
|
|
|
[Service]
|
|
Type=simple
|
|
User=www-data
|
|
WorkingDirectory=/opt/sitemap-api
|
|
ExecStart=/opt/sitemap-api/sitemap-api
|
|
Restart=always
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
|
|
## 🔧 Configuration
|
|
|
|
### Environment Variables
|
|
```bash
|
|
export PORT=8080 # Server port
|
|
export DB_PATH=sitemap.db # Database file
|
|
```
|
|
|
|
### Code Constants
|
|
```go
|
|
// crawler/crawler.go
|
|
const maxConcurrent = 5 // Parallel requests
|
|
const httpTimeout = 10 // Seconds
|
|
|
|
// handlers/handler.go
|
|
const channelBuffer = 100 // SSE event buffer
|
|
```
|
|
|
|
## 📝 XML Sitemap Format
|
|
|
|
Generated sitemaps follow the standard:
|
|
```xml
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
|
<url>
|
|
<loc>https://example.com/</loc>
|
|
<lastmod>2024-02-05</lastmod>
|
|
<changefreq>monthly</changefreq>
|
|
<priority>1.0</priority>
|
|
</url>
|
|
<url>
|
|
<loc>https://example.com/about</loc>
|
|
<lastmod>2024-02-05</lastmod>
|
|
<changefreq>monthly</changefreq>
|
|
<priority>0.8</priority>
|
|
</url>
|
|
</urlset>
|
|
```
|
|
|
|
## 🎯 Success Criteria
|
|
|
|
All requirements met:
|
|
- ✅ Go backend with excellent performance
|
|
- ✅ Endpoint: `/generate-sitemap-xml` with UUID response
|
|
- ✅ Endpoint: `/stream/{uuid}` for SSE
|
|
- ✅ Endpoint: `/download/{uuid}` for XML
|
|
- ✅ Multi-user concurrent support
|
|
- ✅ Client metadata tracking (IP, browser, cookies, session)
|
|
- ✅ SQLite storage
|
|
- ✅ Root route `/` serves HTML
|
|
- ✅ Real-time progress updates
|
|
- ✅ Clean, maintainable code structure
|
|
|
|
## 📚 Next Steps
|
|
|
|
To extend this project:
|
|
1. Add user authentication (JWT tokens)
|
|
2. Implement rate limiting (go-rate package)
|
|
3. Add robots.txt parsing (robotstxt.go package)
|
|
4. Support sitemap index for large sites
|
|
5. Add scheduling/cron jobs for recurring crawls
|
|
6. Implement incremental updates
|
|
7. Add webhook notifications
|
|
8. Create admin dashboard
|
|
9. Export to other formats (JSON, CSV)
|
|
10. Add analytics and usage stats
|
|
|
|
---
|
|
|
|
**Ready to use! Just run `./run.sh` or `make run` to get started.**
|