init
This commit is contained in:
447
PROJECT_OVERVIEW.md
Normal file
447
PROJECT_OVERVIEW.md
Normal file
@@ -0,0 +1,447 @@
|
||||
# 🗺️ XML Sitemap Generator - Complete Implementation
|
||||
|
||||
## Project Overview
|
||||
|
||||
A production-ready Go API for generating XML sitemaps with real-time progress tracking. Built with concurrent crawling, SSE streaming, and comprehensive client metadata tracking.
|
||||
|
||||
## ✨ Key Features Implemented
|
||||
|
||||
### 1. **Backend-Generated UUID System**
|
||||
- Server generates unique UUID for each crawl request
|
||||
- UUID used for SSE stream connection and file download
|
||||
- Enables true multi-user support with isolated streams
|
||||
|
||||
### 2. **Server-Sent Events (SSE) Streaming**
|
||||
- Real-time progress updates via `/stream/{uuid}`
|
||||
- Event types: `connected`, `started`, `progress`, `complete`, `error`
|
||||
- Non-blocking concurrent stream management
|
||||
- Automatic cleanup after completion
|
||||
|
||||
### 3. **Concurrent Web Crawler**
|
||||
- Goroutine-based parallel crawling
|
||||
- Configurable concurrency limit (default: 5 parallel requests)
|
||||
- Depth-limited crawling (1-5 levels)
|
||||
- Same-domain restriction with URL normalization
|
||||
- Duplicate detection and prevention
|
||||
|
||||
### 4. **Client Metadata Tracking**
|
||||
Automatically captured and stored in SQLite:
|
||||
- IP Address (with X-Forwarded-For support)
|
||||
- User-Agent string
|
||||
- Browser name & version (Chrome, Firefox, Safari, Edge, Opera)
|
||||
- Operating System (Windows, macOS, Linux, Android, iOS)
|
||||
- Device Type (Desktop, Mobile, Tablet)
|
||||
- Session ID (cookie-based persistence)
|
||||
- All cookies (JSON-encoded)
|
||||
- HTTP Referrer
|
||||
|
||||
### 5. **RESTful API Endpoints**
|
||||
```
|
||||
POST /generate-sitemap-xml → Start crawl, returns UUID
|
||||
GET /stream/{uuid} → SSE progress stream
|
||||
GET /download/{uuid} → Download XML sitemap
|
||||
GET /sites → List all sitemaps
|
||||
GET /sites/{id} → Get specific site
|
||||
DELETE /sites/{id} → Delete sitemap
|
||||
GET /health → Health check
|
||||
GET / → Serve frontend HTML
|
||||
```
|
||||
|
||||
### 6. **Beautiful Frontend UI**
|
||||
- Responsive gradient design
|
||||
- Real-time progress visualization
|
||||
- Live connection status indicator
|
||||
- Crawl statistics (pages found, depth, time)
|
||||
- Activity log with color-coded entries
|
||||
- Site management (view, download, delete)
|
||||
- Auto-protocol addition for URLs
|
||||
|
||||
## 🏗️ Architecture
|
||||
|
||||
```
|
||||
┌─────────────┐
|
||||
│ Browser │
|
||||
│ (Frontend) │
|
||||
└──────┬──────┘
|
||||
│ POST /generate-sitemap-xml
|
||||
↓
|
||||
┌──────────────────────────────────┐
|
||||
│ Go HTTP Server (Chi Router) │
|
||||
│ │
|
||||
│ ┌────────────────────────────┐ │
|
||||
│ │ Handler (handler.go) │ │
|
||||
│ │ - Generate UUID │ │
|
||||
│ │ - Extract metadata │ │
|
||||
│ │ - Create DB record │ │
|
||||
│ │ - Spawn crawler │ │
|
||||
│ │ - Return UUID immediately│ │
|
||||
│ └─────────────┬──────────────┘ │
|
||||
└────────────────┼────────────────┘
|
||||
│
|
||||
┌─────────┴─────────┐
|
||||
│ │
|
||||
↓ ↓
|
||||
┌──────────────┐ ┌───────────────┐
|
||||
│ StreamManager│ │ Crawler │
|
||||
│ │ │ │
|
||||
│ UUID → Chan │ │ Goroutines │
|
||||
│ Map storage │←──│ Concurrent │
|
||||
│ │ │ HTTP requests│
|
||||
└──────┬───────┘ └───────┬───────┘
|
||||
│ │
|
||||
│ SSE Events │ Save pages
|
||||
↓ ↓
|
||||
┌──────────────────────────────────┐
|
||||
│ SQLite Database │
|
||||
│ - sites (with metadata) │
|
||||
│ - pages (discovered URLs) │
|
||||
│ - sessions (tracking) │
|
||||
└──────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 📂 File Structure
|
||||
|
||||
```
|
||||
sitemap-api/
|
||||
├── main.go # HTTP server setup, routes
|
||||
├── go.mod # Go module dependencies
|
||||
├── go.sum # Dependency checksums
|
||||
│
|
||||
├── handlers/
|
||||
│ └── handler.go # All HTTP handlers
|
||||
│ - GenerateSitemapXML # POST endpoint
|
||||
│ - StreamSSE # SSE streaming
|
||||
│ - DownloadSitemap # XML generation
|
||||
│ - GetSites/GetSite # CRUD operations
|
||||
│ - DeleteSite # Cleanup
|
||||
│ - StreamManager # Concurrent stream management
|
||||
│
|
||||
├── crawler/
|
||||
│ └── crawler.go # Web crawler implementation
|
||||
│ - Crawl() # Main crawl logic
|
||||
│ - crawlURL() # Recursive URL processing
|
||||
│ - extractLinks() # HTML parsing
|
||||
│ - normalizeURL() # URL canonicalization
|
||||
│ - isSameDomain() # Domain checking
|
||||
│ - calculatePriority() # Sitemap priority
|
||||
│
|
||||
├── database/
|
||||
│ └── db.go # SQLite operations
|
||||
│ - NewDB() # Initialize DB
|
||||
│ - createTables() # Schema creation
|
||||
│ - CreateSite() # Insert site record
|
||||
│ - GetSiteByUUID() # Retrieve by UUID
|
||||
│ - UpdateSiteStatus() # Mark complete
|
||||
│ - AddPage() # Save discovered page
|
||||
│ - GetPagesBySiteID() # Retrieve all pages
|
||||
│ - DeleteSite() # Cascade delete
|
||||
│
|
||||
├── models/
|
||||
│ └── site.go # Data structures
|
||||
│ - Site # Site record
|
||||
│ - Page # Page record
|
||||
│ - Event # SSE event
|
||||
│ - ProgressData # Progress payload
|
||||
│ - CompleteData # Completion payload
|
||||
│ - ErrorData # Error payload
|
||||
│
|
||||
├── static/
|
||||
│ └── index.html # Frontend application
|
||||
│ - SitemapGenerator # Main class
|
||||
│ - generateSitemap() # Initiate crawl
|
||||
│ - connectToStream() # SSE connection
|
||||
│ - updateProgress() # Live updates
|
||||
│ - downloadSitemap() # File download
|
||||
│ - displaySites() # Results listing
|
||||
│
|
||||
├── README.md # Full documentation
|
||||
├── QUICKSTART.md # Quick start guide
|
||||
├── Makefile # Build automation
|
||||
├── Dockerfile # Container setup
|
||||
├── run.sh # Startup script
|
||||
├── .gitignore # Git exclusions
|
||||
└── .env.example # Environment template
|
||||
```
|
||||
|
||||
## 🔄 Request Flow
|
||||
|
||||
### 1. Generate Sitemap Request
|
||||
```
|
||||
User fills form → POST /generate-sitemap-xml
|
||||
↓
|
||||
Server generates UUID
|
||||
↓
|
||||
Extract IP, UA, cookies, session
|
||||
↓
|
||||
Save to database (status: processing)
|
||||
↓
|
||||
Create SSE channel in StreamManager
|
||||
↓
|
||||
Spawn goroutine for crawler (non-blocking)
|
||||
↓
|
||||
Return UUID immediately to frontend
|
||||
```
|
||||
|
||||
### 2. SSE Stream Connection
|
||||
```
|
||||
Frontend receives UUID → GET /stream/{uuid}
|
||||
↓
|
||||
StreamManager finds channel
|
||||
↓
|
||||
Send "connected" event
|
||||
↓
|
||||
Crawler sends events to channel
|
||||
↓
|
||||
Handler forwards to browser
|
||||
↓
|
||||
Frontend updates UI in real-time
|
||||
```
|
||||
|
||||
### 3. Crawler Operation
|
||||
```
|
||||
Start from root URL → Fetch HTML
|
||||
↓
|
||||
Parse <a> tags for links
|
||||
↓
|
||||
Check: same domain? not visited?
|
||||
↓
|
||||
Save page to database (URL, depth, priority)
|
||||
↓
|
||||
Send "progress" event via channel
|
||||
↓
|
||||
Spawn goroutines for child URLs
|
||||
↓
|
||||
Repeat until max depth reached
|
||||
↓
|
||||
Send "complete" event
|
||||
↓
|
||||
Close channel, cleanup resources
|
||||
```
|
||||
|
||||
### 4. Download Request
|
||||
```
|
||||
User clicks download → GET /download/{uuid}
|
||||
↓
|
||||
Lookup site by UUID
|
||||
↓
|
||||
Fetch all pages from database
|
||||
↓
|
||||
Generate XML sitemap
|
||||
↓
|
||||
Set Content-Disposition header
|
||||
↓
|
||||
Stream XML to browser
|
||||
```
|
||||
|
||||
## 🔐 Security Considerations
|
||||
|
||||
### Implemented
|
||||
- ✅ Same-domain restriction (no external crawling)
|
||||
- ✅ Max depth limit (prevents infinite loops)
|
||||
- ✅ HTTP timeout per request (10 seconds)
|
||||
- ✅ Duplicate URL prevention
|
||||
- ✅ SQLite prepared statements (SQL injection safe)
|
||||
- ✅ CORS middleware included
|
||||
|
||||
### Recommended for Production
|
||||
- [ ] Rate limiting per IP
|
||||
- [ ] Authentication/API keys
|
||||
- [ ] Input validation & sanitization
|
||||
- [ ] Request size limits
|
||||
- [ ] robots.txt respect
|
||||
- [ ] User-Agent identification
|
||||
- [ ] HTTPS enforcement
|
||||
- [ ] Firewall rules
|
||||
|
||||
## 🚀 Performance Optimization
|
||||
|
||||
### Current
|
||||
- Concurrent goroutines (5 parallel requests default)
|
||||
- Non-blocking SSE streams
|
||||
- Efficient channel-based communication
|
||||
- In-memory visited URL tracking
|
||||
- Database connection pooling
|
||||
|
||||
### Possible Improvements
|
||||
- Redis for distributed crawling
|
||||
- Worker pool pattern
|
||||
- Content caching
|
||||
- Incremental sitemap updates
|
||||
- Compression for large sitemaps
|
||||
- Database indexing optimization
|
||||
|
||||
## 📊 Database Schema
|
||||
|
||||
### sites table
|
||||
```sql
|
||||
- id (PK) - Auto-increment
|
||||
- uuid (UNIQUE) - Server-generated UUID
|
||||
- domain - Extracted from URL
|
||||
- url - Full starting URL
|
||||
- max_depth - Crawl depth limit
|
||||
- page_count - Total pages found
|
||||
- status - processing/completed/failed
|
||||
- ip_address - Client IP
|
||||
- user_agent - Full UA string
|
||||
- browser - Parsed browser name
|
||||
- browser_version - Version number
|
||||
- os - Operating system
|
||||
- device_type - Desktop/Mobile/Tablet
|
||||
- session_id - Cookie-based session
|
||||
- cookies - JSON of all cookies
|
||||
- referrer - HTTP Referer header
|
||||
- created_at - Timestamp
|
||||
- completed_at - Completion timestamp
|
||||
- last_crawled - Last activity
|
||||
```
|
||||
|
||||
### pages table
|
||||
```sql
|
||||
- id (PK) - Auto-increment
|
||||
- site_id (FK) - References sites(id)
|
||||
- url - Page URL (UNIQUE)
|
||||
- depth - Crawl depth level
|
||||
- last_modified - Discovery time
|
||||
- priority - Sitemap priority (0.0-1.0)
|
||||
- change_freq - monthly/weekly/daily/etc
|
||||
```
|
||||
|
||||
### sessions table
|
||||
```sql
|
||||
- id (PK) - Auto-increment
|
||||
- session_id (UNIQUE) - Session UUID
|
||||
- uuid (FK) - References sites(uuid)
|
||||
- ip_address - Client IP
|
||||
- created_at - First seen
|
||||
- last_activity - Last request
|
||||
```
|
||||
|
||||
## 🧪 Testing
|
||||
|
||||
### Manual Testing
|
||||
```bash
|
||||
# Terminal 1: Start server
|
||||
./run.sh
|
||||
|
||||
# Terminal 2: Test API
|
||||
curl -X POST http://localhost:8080/generate-sitemap-xml \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"url":"https://example.com","max_depth":2}'
|
||||
|
||||
# Terminal 3: Watch SSE stream
|
||||
curl -N http://localhost:8080/stream/{uuid}
|
||||
```
|
||||
|
||||
### Browser Testing
|
||||
1. Open multiple tabs to http://localhost:8080
|
||||
2. Start different crawls simultaneously
|
||||
3. Verify independent progress tracking
|
||||
4. Check database for metadata
|
||||
|
||||
### Database Verification
|
||||
```bash
|
||||
sqlite3 sitemap.db "SELECT * FROM sites ORDER BY created_at DESC LIMIT 5;"
|
||||
sqlite3 sitemap.db "SELECT COUNT(*) FROM pages WHERE site_id = 1;"
|
||||
```
|
||||
|
||||
## 📦 Deployment Options
|
||||
|
||||
### Option 1: Binary
|
||||
```bash
|
||||
go build -o sitemap-api
|
||||
./sitemap-api
|
||||
```
|
||||
|
||||
### Option 2: Docker
|
||||
```bash
|
||||
docker build -t sitemap-api .
|
||||
docker run -p 8080:8080 sitemap-api
|
||||
```
|
||||
|
||||
### Option 3: Systemd Service
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Sitemap Generator API
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=www-data
|
||||
WorkingDirectory=/opt/sitemap-api
|
||||
ExecStart=/opt/sitemap-api/sitemap-api
|
||||
Restart=always
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
## 🔧 Configuration
|
||||
|
||||
### Environment Variables
|
||||
```bash
|
||||
export PORT=8080 # Server port
|
||||
export DB_PATH=sitemap.db # Database file
|
||||
```
|
||||
|
||||
### Code Constants
|
||||
```go
|
||||
// crawler/crawler.go
|
||||
const maxConcurrent = 5 // Parallel requests
|
||||
const httpTimeout = 10 // Seconds
|
||||
|
||||
// handlers/handler.go
|
||||
const channelBuffer = 100 // SSE event buffer
|
||||
```
|
||||
|
||||
## 📝 XML Sitemap Format
|
||||
|
||||
Generated sitemaps follow the standard:
|
||||
```xml
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
<url>
|
||||
<loc>https://example.com/</loc>
|
||||
<lastmod>2024-02-05</lastmod>
|
||||
<changefreq>monthly</changefreq>
|
||||
<priority>1.0</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>https://example.com/about</loc>
|
||||
<lastmod>2024-02-05</lastmod>
|
||||
<changefreq>monthly</changefreq>
|
||||
<priority>0.8</priority>
|
||||
</url>
|
||||
</urlset>
|
||||
```
|
||||
|
||||
## 🎯 Success Criteria
|
||||
|
||||
All requirements met:
|
||||
- ✅ Go backend with excellent performance
|
||||
- ✅ Endpoint: `/generate-sitemap-xml` with UUID response
|
||||
- ✅ Endpoint: `/stream/{uuid}` for SSE
|
||||
- ✅ Endpoint: `/download/{uuid}` for XML
|
||||
- ✅ Multi-user concurrent support
|
||||
- ✅ Client metadata tracking (IP, browser, cookies, session)
|
||||
- ✅ SQLite storage
|
||||
- ✅ Root route `/` serves HTML
|
||||
- ✅ Real-time progress updates
|
||||
- ✅ Clean, maintainable code structure
|
||||
|
||||
## 📚 Next Steps
|
||||
|
||||
To extend this project:
|
||||
1. Add user authentication (JWT tokens)
|
||||
2. Implement rate limiting (go-rate package)
|
||||
3. Add robots.txt parsing (robotstxt.go package)
|
||||
4. Support sitemap index for large sites
|
||||
5. Add scheduling/cron jobs for recurring crawls
|
||||
6. Implement incremental updates
|
||||
7. Add webhook notifications
|
||||
8. Create admin dashboard
|
||||
9. Export to other formats (JSON, CSV)
|
||||
10. Add analytics and usage stats
|
||||
|
||||
---
|
||||
|
||||
**Ready to use! Just run `./run.sh` or `make run` to get started.**
|
||||
Reference in New Issue
Block a user