This commit is contained in:
Kar
2026-02-05 19:23:03 +05:30
parent 10b19d4ed6
commit b80e988191
10 changed files with 284 additions and 0 deletions

View File

@@ -0,0 +1,447 @@
# 🗺️ XML Sitemap Generator - Complete Implementation
## Project Overview
A production-ready Go API for generating XML sitemaps with real-time progress tracking. Built with concurrent crawling, SSE streaming, and comprehensive client metadata tracking.
## ✨ Key Features Implemented
### 1. **Backend-Generated UUID System**
- Server generates unique UUID for each crawl request
- UUID used for SSE stream connection and file download
- Enables true multi-user support with isolated streams
### 2. **Server-Sent Events (SSE) Streaming**
- Real-time progress updates via `/stream/{uuid}`
- Event types: `connected`, `started`, `progress`, `complete`, `error`
- Non-blocking concurrent stream management
- Automatic cleanup after completion
### 3. **Concurrent Web Crawler**
- Goroutine-based parallel crawling
- Configurable concurrency limit (default: 5 parallel requests)
- Depth-limited crawling (1-5 levels)
- Same-domain restriction with URL normalization
- Duplicate detection and prevention
### 4. **Client Metadata Tracking**
Automatically captured and stored in SQLite:
- IP Address (with X-Forwarded-For support)
- User-Agent string
- Browser name & version (Chrome, Firefox, Safari, Edge, Opera)
- Operating System (Windows, macOS, Linux, Android, iOS)
- Device Type (Desktop, Mobile, Tablet)
- Session ID (cookie-based persistence)
- All cookies (JSON-encoded)
- HTTP Referrer
### 5. **RESTful API Endpoints**
```
POST /generate-sitemap-xml → Start crawl, returns UUID
GET /stream/{uuid} → SSE progress stream
GET /download/{uuid} → Download XML sitemap
GET /sites → List all sitemaps
GET /sites/{id} → Get specific site
DELETE /sites/{id} → Delete sitemap
GET /health → Health check
GET / → Serve frontend HTML
```
### 6. **Beautiful Frontend UI**
- Responsive gradient design
- Real-time progress visualization
- Live connection status indicator
- Crawl statistics (pages found, depth, time)
- Activity log with color-coded entries
- Site management (view, download, delete)
- Auto-protocol addition for URLs
## 🏗️ Architecture
```
┌─────────────┐
│ Browser │
│ (Frontend) │
└──────┬──────┘
│ POST /generate-sitemap-xml
┌──────────────────────────────────┐
│ Go HTTP Server (Chi Router) │
│ │
│ ┌────────────────────────────┐ │
│ │ Handler (handler.go) │ │
│ │ - Generate UUID │ │
│ │ - Extract metadata │ │
│ │ - Create DB record │ │
│ │ - Spawn crawler │ │
│ │ - Return UUID immediately│ │
│ └─────────────┬──────────────┘ │
└────────────────┼────────────────┘
┌─────────┴─────────┐
│ │
↓ ↓
┌──────────────┐ ┌───────────────┐
│ StreamManager│ │ Crawler │
│ │ │ │
│ UUID → Chan │ │ Goroutines │
│ Map storage │←──│ Concurrent │
│ │ │ HTTP requests│
└──────┬───────┘ └───────┬───────┘
│ │
│ SSE Events │ Save pages
↓ ↓
┌──────────────────────────────────┐
│ SQLite Database │
│ - sites (with metadata) │
│ - pages (discovered URLs) │
│ - sessions (tracking) │
└──────────────────────────────────┘
```
## 📂 File Structure
```
sitemap-api/
├── main.go # HTTP server setup, routes
├── go.mod # Go module dependencies
├── go.sum # Dependency checksums
├── handlers/
│ └── handler.go # All HTTP handlers
│ - GenerateSitemapXML # POST endpoint
│ - StreamSSE # SSE streaming
│ - DownloadSitemap # XML generation
│ - GetSites/GetSite # CRUD operations
│ - DeleteSite # Cleanup
│ - StreamManager # Concurrent stream management
├── crawler/
│ └── crawler.go # Web crawler implementation
│ - Crawl() # Main crawl logic
│ - crawlURL() # Recursive URL processing
│ - extractLinks() # HTML parsing
│ - normalizeURL() # URL canonicalization
│ - isSameDomain() # Domain checking
│ - calculatePriority() # Sitemap priority
├── database/
│ └── db.go # SQLite operations
│ - NewDB() # Initialize DB
│ - createTables() # Schema creation
│ - CreateSite() # Insert site record
│ - GetSiteByUUID() # Retrieve by UUID
│ - UpdateSiteStatus() # Mark complete
│ - AddPage() # Save discovered page
│ - GetPagesBySiteID() # Retrieve all pages
│ - DeleteSite() # Cascade delete
├── models/
│ └── site.go # Data structures
│ - Site # Site record
│ - Page # Page record
│ - Event # SSE event
│ - ProgressData # Progress payload
│ - CompleteData # Completion payload
│ - ErrorData # Error payload
├── static/
│ └── index.html # Frontend application
│ - SitemapGenerator # Main class
│ - generateSitemap() # Initiate crawl
│ - connectToStream() # SSE connection
│ - updateProgress() # Live updates
│ - downloadSitemap() # File download
│ - displaySites() # Results listing
├── README.md # Full documentation
├── QUICKSTART.md # Quick start guide
├── Makefile # Build automation
├── Dockerfile # Container setup
├── run.sh # Startup script
├── .gitignore # Git exclusions
└── .env.example # Environment template
```
## 🔄 Request Flow
### 1. Generate Sitemap Request
```
User fills form → POST /generate-sitemap-xml
Server generates UUID
Extract IP, UA, cookies, session
Save to database (status: processing)
Create SSE channel in StreamManager
Spawn goroutine for crawler (non-blocking)
Return UUID immediately to frontend
```
### 2. SSE Stream Connection
```
Frontend receives UUID → GET /stream/{uuid}
StreamManager finds channel
Send "connected" event
Crawler sends events to channel
Handler forwards to browser
Frontend updates UI in real-time
```
### 3. Crawler Operation
```
Start from root URL → Fetch HTML
Parse <a> tags for links
Check: same domain? not visited?
Save page to database (URL, depth, priority)
Send "progress" event via channel
Spawn goroutines for child URLs
Repeat until max depth reached
Send "complete" event
Close channel, cleanup resources
```
### 4. Download Request
```
User clicks download → GET /download/{uuid}
Lookup site by UUID
Fetch all pages from database
Generate XML sitemap
Set Content-Disposition header
Stream XML to browser
```
## 🔐 Security Considerations
### Implemented
- ✅ Same-domain restriction (no external crawling)
- ✅ Max depth limit (prevents infinite loops)
- ✅ HTTP timeout per request (10 seconds)
- ✅ Duplicate URL prevention
- ✅ SQLite prepared statements (SQL injection safe)
- ✅ CORS middleware included
### Recommended for Production
- [ ] Rate limiting per IP
- [ ] Authentication/API keys
- [ ] Input validation & sanitization
- [ ] Request size limits
- [ ] robots.txt respect
- [ ] User-Agent identification
- [ ] HTTPS enforcement
- [ ] Firewall rules
## 🚀 Performance Optimization
### Current
- Concurrent goroutines (5 parallel requests default)
- Non-blocking SSE streams
- Efficient channel-based communication
- In-memory visited URL tracking
- Database connection pooling
### Possible Improvements
- Redis for distributed crawling
- Worker pool pattern
- Content caching
- Incremental sitemap updates
- Compression for large sitemaps
- Database indexing optimization
## 📊 Database Schema
### sites table
```sql
- id (PK) - Auto-increment
- uuid (UNIQUE) - Server-generated UUID
- domain - Extracted from URL
- url - Full starting URL
- max_depth - Crawl depth limit
- page_count - Total pages found
- status - processing/completed/failed
- ip_address - Client IP
- user_agent - Full UA string
- browser - Parsed browser name
- browser_version - Version number
- os - Operating system
- device_type - Desktop/Mobile/Tablet
- session_id - Cookie-based session
- cookies - JSON of all cookies
- referrer - HTTP Referer header
- created_at - Timestamp
- completed_at - Completion timestamp
- last_crawled - Last activity
```
### pages table
```sql
- id (PK) - Auto-increment
- site_id (FK) - References sites(id)
- url - Page URL (UNIQUE)
- depth - Crawl depth level
- last_modified - Discovery time
- priority - Sitemap priority (0.0-1.0)
- change_freq - monthly/weekly/daily/etc
```
### sessions table
```sql
- id (PK) - Auto-increment
- session_id (UNIQUE) - Session UUID
- uuid (FK) - References sites(uuid)
- ip_address - Client IP
- created_at - First seen
- last_activity - Last request
```
## 🧪 Testing
### Manual Testing
```bash
# Terminal 1: Start server
./run.sh
# Terminal 2: Test API
curl -X POST http://localhost:8080/generate-sitemap-xml \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com","max_depth":2}'
# Terminal 3: Watch SSE stream
curl -N http://localhost:8080/stream/{uuid}
```
### Browser Testing
1. Open multiple tabs to http://localhost:8080
2. Start different crawls simultaneously
3. Verify independent progress tracking
4. Check database for metadata
### Database Verification
```bash
sqlite3 sitemap.db "SELECT * FROM sites ORDER BY created_at DESC LIMIT 5;"
sqlite3 sitemap.db "SELECT COUNT(*) FROM pages WHERE site_id = 1;"
```
## 📦 Deployment Options
### Option 1: Binary
```bash
go build -o sitemap-api
./sitemap-api
```
### Option 2: Docker
```bash
docker build -t sitemap-api .
docker run -p 8080:8080 sitemap-api
```
### Option 3: Systemd Service
```ini
[Unit]
Description=Sitemap Generator API
After=network.target
[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/sitemap-api
ExecStart=/opt/sitemap-api/sitemap-api
Restart=always
[Install]
WantedBy=multi-user.target
```
## 🔧 Configuration
### Environment Variables
```bash
export PORT=8080 # Server port
export DB_PATH=sitemap.db # Database file
```
### Code Constants
```go
// crawler/crawler.go
const maxConcurrent = 5 // Parallel requests
const httpTimeout = 10 // Seconds
// handlers/handler.go
const channelBuffer = 100 // SSE event buffer
```
## 📝 XML Sitemap Format
Generated sitemaps follow the standard:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2024-02-05</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/about</loc>
<lastmod>2024-02-05</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
```
## 🎯 Success Criteria
All requirements met:
- ✅ Go backend with excellent performance
- ✅ Endpoint: `/generate-sitemap-xml` with UUID response
- ✅ Endpoint: `/stream/{uuid}` for SSE
- ✅ Endpoint: `/download/{uuid}` for XML
- ✅ Multi-user concurrent support
- ✅ Client metadata tracking (IP, browser, cookies, session)
- ✅ SQLite storage
- ✅ Root route `/` serves HTML
- ✅ Real-time progress updates
- ✅ Clean, maintainable code structure
## 📚 Next Steps
To extend this project:
1. Add user authentication (JWT tokens)
2. Implement rate limiting (go-rate package)
3. Add robots.txt parsing (robotstxt.go package)
4. Support sitemap index for large sites
5. Add scheduling/cron jobs for recurring crawls
6. Implement incremental updates
7. Add webhook notifications
8. Create admin dashboard
9. Export to other formats (JSON, CSV)
10. Add analytics and usage stats
---
**Ready to use! Just run `./run.sh` or `make run` to get started.**

152
Documentation/QUICKSTART.md Normal file
View File

@@ -0,0 +1,152 @@
# 🚀 Quick Start Guide
Get your sitemap generator running in 3 steps!
## Step 1: Install Go
If you don't have Go installed:
- Download from https://golang.org/dl/
- Install Go 1.21 or later
- Verify: `go version`
## Step 2: Run the Application
### Option A: Using the run script (easiest)
```bash
cd sitemap-api
./run.sh
```
### Option B: Using Make
```bash
cd sitemap-api
make run
```
### Option C: Manual
```bash
cd sitemap-api
go mod download
go build -o sitemap-api .
./sitemap-api
```
## Step 3: Use the Application
1. **Open your browser** → http://localhost:8080
2. **Enter a URL** → e.g., `https://example.com`
3. **Set crawl depth** → 1-5 (default: 3)
4. **Click "Generate Sitemap"** → Watch real-time progress!
5. **Download XML** → Click the download button when complete
## Testing Multiple Users
Open multiple browser tabs to http://localhost:8080 and start different crawls simultaneously. Each will have its own UUID and progress stream!
## API Usage Examples
### Start a crawl
```bash
curl -X POST http://localhost:8080/generate-sitemap-xml \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "max_depth": 3}'
```
Response:
```json
{
"uuid": "550e8400-e29b-41d4-a716-446655440000",
"site_id": 123,
"status": "processing",
"stream_url": "/stream/550e8400-e29b-41d4-a716-446655440000",
"message": "Sitemap generation started"
}
```
### Monitor progress (SSE)
```bash
curl http://localhost:8080/stream/550e8400-e29b-41d4-a716-446655440000
```
### Download sitemap
```bash
curl http://localhost:8080/download/550e8400-e29b-41d4-a716-446655440000 -o sitemap.xml
```
### List all sitemaps
```bash
curl http://localhost:8080/sites
```
### Delete a sitemap
```bash
curl -X DELETE http://localhost:8080/sites/123
```
## Troubleshooting
### Port already in use
```bash
PORT=3000 ./sitemap-api
```
### Build errors
```bash
go mod tidy
go clean -cache
go build -o sitemap-api .
```
### Database locked
```bash
rm sitemap.db
./sitemap-api
```
### CGO errors
Make sure you have gcc installed:
- **Ubuntu/Debian**: `sudo apt-get install build-essential`
- **macOS**: `xcode-select --install`
- **Windows**: Install MinGW or TDM-GCC
## Next Steps
- Read the full [README.md](README.md) for details
- Customize the crawler in `crawler/crawler.go`
- Add authentication to handlers
- Deploy to production (see README for nginx config)
- Add more metadata tracking
## Project Structure
```
sitemap-api/
├── main.go # Server entry point
├── handlers/ # HTTP handlers & SSE
├── crawler/ # Web crawler logic
├── database/ # SQLite operations
├── models/ # Data structures
├── static/ # Frontend (served at /)
├── README.md # Full documentation
├── run.sh # Quick start script
├── Makefile # Build commands
└── Dockerfile # Container setup
```
## Support
Having issues? Check:
1. Go version >= 1.21
2. Port 8080 is available
3. SQLite3 is working
4. All dependencies installed
Still stuck? Open an issue on GitHub!
---
**Built with ❤️ using Go + Goroutines + Server-Sent Events**

213
Documentation/README.md Normal file
View File

@@ -0,0 +1,213 @@
# XML Sitemap Generator API
A high-performance Go-based API for generating XML sitemaps with real-time progress tracking via Server-Sent Events (SSE).
## Features
-**Concurrent Web Crawling** - Fast sitemap generation using goroutines
-**Real-time Progress** - SSE streaming for live updates
-**Multi-user Support** - Handle multiple simultaneous crawls
-**Client Metadata Tracking** - IP, browser, OS, session data stored in SQLite
-**Clean REST API** - Simple endpoints for generate, stream, and download
-**Professional UI** - Beautiful web interface included
## Architecture
```
sitemap-api/
├── main.go # Entry point & HTTP server
├── handlers/
│ └── handler.go # HTTP handlers & SSE streaming
├── crawler/
│ └── crawler.go # Concurrent web crawler
├── database/
│ └── db.go # SQLite operations
├── models/
│ └── site.go # Data structures
└── static/
└── index.html # Frontend UI
```
## API Endpoints
### `POST /generate-sitemap-xml`
Start sitemap generation (backend generates UUID)
**Request:**
```json
{
"url": "https://example.com",
"max_depth": 3
}
```
**Response:**
```json
{
"uuid": "550e8400-e29b-41d4-a716-446655440000",
"site_id": 123,
"status": "processing",
"stream_url": "/stream/550e8400-...",
"message": "Sitemap generation started"
}
```
### `GET /stream/{uuid}`
Server-Sent Events stream for real-time progress
**Events:** `connected`, `started`, `progress`, `complete`, `error`
### `GET /download/{uuid}`
Download generated sitemap XML
### `GET /sites`
List all generated sitemaps
### `GET /sites/{id}`
Get specific site details
### `DELETE /sites/{id}`
Delete a sitemap
### `GET /health`
Health check endpoint
## Installation
### Prerequisites
- Go 1.21+
- SQLite3
### Setup
```bash
# Clone/navigate to directory
cd sitemap-api
# Install dependencies
go mod download
# Build
go build -o sitemap-api
# Run
./sitemap-api
```
Server starts on **http://localhost:8080**
### Or run directly:
```bash
go run main.go
```
## Usage
1. Open http://localhost:8080 in your browser
2. Enter a website URL
3. Set crawl depth (1-5)
4. Click "Generate Sitemap"
5. Watch real-time progress
6. Download XML when complete
## Database Schema
SQLite database (`sitemap.db`) stores:
- **sites** - Crawl sessions with client metadata
- **pages** - Discovered URLs with priority/frequency
- **sessions** - User session tracking
## Environment Variables
- `PORT` - Server port (default: 8080)
Example:
```bash
PORT=3000 ./sitemap-api
```
## How It Works
1. **Frontend** sends POST to `/generate-sitemap-xml`
2. **Backend** generates UUID, saves metadata, returns UUID
3. **Frontend** connects to `/stream/{uuid}` for SSE updates
4. **Crawler** runs in goroutine, sends events via channel
5. **Handler** streams events to frontend in real-time
6. **On completion**, sitemap available at `/download/{uuid}`
## Multi-User Concurrency
The `StreamManager` handles concurrent users:
- Each UUID maps to a Go channel
- Concurrent map with mutex for thread safety
- Automatic cleanup after crawl completion
- Supports unlimited simultaneous crawls
## Client Metadata Captured
- IP Address (with X-Forwarded-For support)
- User-Agent
- Browser name & version
- Operating System
- Device Type (Desktop/Mobile/Tablet)
- Session ID (cookie-based)
- All cookies (JSON)
- Referrer
## Performance
- Concurrent crawling with goroutines
- Configurable concurrency limit (default: 5 parallel requests)
- Depth-limited to prevent infinite crawls
- Same-domain restriction
- Duplicate URL prevention
- 10-second HTTP timeout per request
## Customization
### Adjust Concurrency
Edit `crawler/crawler.go`:
```go
semaphore := make(chan struct{}, 10) // Increase to 10 concurrent
```
### Change Priority Calculation
Modify `calculatePriority()` in `crawler/crawler.go`
### Add Custom Metadata
Extend `models.Site` struct and database schema
## Production Deployment
### Recommendations:
1. Use reverse proxy (nginx/caddy)
2. Enable HTTPS
3. Add rate limiting
4. Configure CORS properly
5. Use PostgreSQL for production (replace SQLite)
6. Add authentication
7. Implement cleanup jobs for old sitemaps
### Example nginx config:
```nginx
location / {
proxy_pass http://localhost:8080;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
# SSE support
proxy_buffering off;
proxy_cache off;
}
```
## License
MIT
## Support
For issues or questions, please open a GitHub issue.

280
Documentation/STRUCTURE.md Normal file
View File

@@ -0,0 +1,280 @@
# 📁 SITEMAP-API PROJECT STRUCTURE
## ROOT FILES
```
main.go ⚙️ HTTP server, routes, middleware
go.mod 📦 Dependencies (chi, cors, uuid, sqlite3)
run.sh 🚀 Quick start script
Makefile 🔧 Build commands (run, build, clean, test)
Dockerfile 🐳 Container configuration
.gitignore 🚫 Git exclusions
.env.example ⚙️ Environment template
```
## DOCUMENTATION
```
README.md 📖 Full API documentation
QUICKSTART.md ⏱️ 3-step quick start guide
PROJECT_OVERVIEW.md 📊 Complete implementation details
```
## CODE STRUCTURE
### handlers/
```
└── handler.go 🎯 HTTP REQUEST HANDLERS
- GenerateSitemapXML() POST /generate-sitemap-xml
- StreamSSE() GET /stream/{uuid}
- DownloadSitemap() GET /download/{uuid}
- GetSites() GET /sites
- GetSite() GET /sites/{id}
- DeleteSite() DELETE /sites/{id}
- Health() GET /health
🔄 STREAM MANAGER
- NewStreamManager() Concurrent SSE handling
- CreateStream() Per-UUID channels
- GetStream() Retrieve channel
- CloseStream() Cleanup
🔍 METADATA EXTRACTORS
- getClientIP() IP address
- parseBrowser() Browser detection
- parseOS() OS detection
- parseDeviceType() Device detection
- extractCookies() Cookie parsing
- getOrCreateSession() Session management
```
### crawler/
```
└── crawler.go 🕷️ WEB CRAWLER ENGINE
- NewCrawler() Initialize crawler
- Crawl() Main crawl orchestrator
- crawlURL() Recursive URL processing
- extractLinks() HTML link extraction
- resolveURL() Relative → absolute
- normalizeURL() URL canonicalization
- isSameDomain() Domain validation
- calculatePriority() Sitemap priority (0-1.0)
- sendEvent() SSE event emission
```
### database/
```
└── db.go 💾 SQLITE DATABASE LAYER
- NewDB() Initialize DB
- createTables() Schema setup
- CreateSite() Insert site
- GetSiteByUUID() Fetch by UUID
- GetSiteByID() Fetch by ID
- GetAllSites() List all
- UpdateSiteStatus() Mark complete/failed
- DeleteSite() Remove site
- AddPage() Insert page
- GetPagesBySiteID() Fetch pages
```
### models/
```
└── site.go 📋 DATA STRUCTURES
- Site Main site record
- Page Discovered page
- Event SSE event
- ProgressData Progress payload
- CompleteData Completion payload
- ErrorData Error payload
```
### static/
```
└── index.html 🎨 FRONTEND APPLICATION
HTML:
- Form section URL input, depth selector
- Progress section Live stats, progress bar
- Log section Activity console
- Results section Site list, download buttons
JavaScript:
- SitemapGenerator class Main controller
- generateSitemap() POST to API
- connectToStream() SSE connection
- updateProgress() Live UI updates
- downloadSitemap() File download
- loadExistingSites() Fetch site list
- displaySites() Render results
```
## RUNTIME GENERATED
```
sitemap.db 💾 SQLite database (auto-created on first run)
sitemap.db-journal 📝 SQLite temp file
sitemap-api ⚙️ Compiled binary (from: go build)
go.sum 🔒 Dependency checksums (from: go mod download)
```
## FILE COUNTS
```
Go source files: 5 files
HTML files: 1 file
Documentation: 3 files
Config files: 6 files
─────────────────────────
Total: 15 files
```
## LINES OF CODE
```
handlers/handler.go ~600 lines (HTTP handlers, SSE, metadata)
crawler/crawler.go ~250 lines (Concurrent crawler)
database/db.go ~250 lines (SQLite operations)
models/site.go ~50 lines (Data structures)
main.go ~70 lines (Server setup)
static/index.html ~850 lines (Full UI with CSS & JS)
─────────────────────────────────────
Total: ~2,070 lines
```
## KEY DEPENDENCIES (go.mod)
```
github.com/go-chi/chi/v5 Router & middleware
github.com/go-chi/cors CORS support
github.com/google/uuid UUID generation
github.com/mattn/go-sqlite3 SQLite driver
golang.org/x/net HTML parsing
```
## VISUAL TREE
```
sitemap-api/
├── 📄 main.go # Entry point & server
├── 📦 go.mod # Dependencies
├── 🚀 run.sh # Quick start
├── 🔧 Makefile # Build commands
├── 🐳 Dockerfile # Containerization
├── ⚙️ .env.example # Config template
├── 🚫 .gitignore # Git exclusions
├── 📚 Documentation/
│ ├── README.md # Full docs
│ ├── QUICKSTART.md # Quick start
│ └── PROJECT_OVERVIEW.md # Implementation details
├── 🎯 handlers/
│ └── handler.go # All HTTP endpoints + SSE
├── 🕷️ crawler/
│ └── crawler.go # Concurrent web crawler
├── 💾 database/
│ └── db.go # SQLite operations
├── 📋 models/
│ └── site.go # Data structures
└── 🎨 static/
└── index.html # Frontend UI
```
## DATA FLOW
```
User Browser
├─► POST /generate-sitemap-xml
│ └─► handlers.GenerateSitemapXML()
│ ├─► Generate UUID
│ ├─► Extract metadata (IP, browser, etc)
│ ├─► database.CreateSite()
│ ├─► streamManager.CreateStream(uuid)
│ ├─► go crawler.Crawl() [goroutine]
│ └─► Return {uuid, site_id, stream_url}
├─► GET /stream/{uuid}
│ └─► handlers.StreamSSE()
│ └─► streamManager.GetStream(uuid)
│ └─► Forward events to browser
└─► GET /download/{uuid}
└─► handlers.DownloadSitemap()
├─► database.GetSiteByUUID()
├─► database.GetPagesBySiteID()
└─► Generate XML sitemap
Crawler (goroutine)
├─► Fetch URL
├─► Parse HTML links
├─► database.AddPage()
├─► Send SSE progress event
└─► Recursively crawl children (with goroutines)
```
## DATABASE SCHEMA
```
┌─────────────────┐
│ sites │
├─────────────────┤
│ id │ PK
│ uuid │ UNIQUE (server-generated)
│ domain │
│ url │
│ max_depth │
│ page_count │
│ status │ (processing/completed/failed)
│ ip_address │ (client metadata)
│ user_agent │
│ browser │
│ browser_version │
│ os │
│ device_type │
│ session_id │
│ cookies │ (JSON)
│ referrer │
│ created_at │
│ completed_at │
│ last_crawled │
└─────────────────┘
│ 1:N
┌─────────────────┐
│ pages │
├─────────────────┤
│ id │ PK
│ site_id │ FK → sites.id
│ url │ UNIQUE
│ depth │
│ last_modified │
│ priority │ (0.0 - 1.0)
│ change_freq │ (monthly/weekly/etc)
└─────────────────┘
┌─────────────────┐
│ sessions │
├─────────────────┤
│ id │ PK
│ session_id │ UNIQUE
│ uuid │ FK → sites.uuid
│ ip_address │
│ created_at │
│ last_activity │
└─────────────────┘
```
## CONCURRENCY MODEL
```
StreamManager
├─► map[uuid]chan Event (thread-safe with mutex)
└─► Per-UUID Channel
└─► Event stream to browser
Crawler
├─► Main goroutine (Crawl)
│ └─► Spawns goroutines for each URL
└─► Semaphore (5 concurrent max)
└─► Controls parallel requests
```