init
This commit is contained in:
447
Documentation/PROJECT_OVERVIEW.md
Normal file
447
Documentation/PROJECT_OVERVIEW.md
Normal file
@@ -0,0 +1,447 @@
|
||||
# 🗺️ XML Sitemap Generator - Complete Implementation
|
||||
|
||||
## Project Overview
|
||||
|
||||
A production-ready Go API for generating XML sitemaps with real-time progress tracking. Built with concurrent crawling, SSE streaming, and comprehensive client metadata tracking.
|
||||
|
||||
## ✨ Key Features Implemented
|
||||
|
||||
### 1. **Backend-Generated UUID System**
|
||||
- Server generates unique UUID for each crawl request
|
||||
- UUID used for SSE stream connection and file download
|
||||
- Enables true multi-user support with isolated streams
|
||||
|
||||
### 2. **Server-Sent Events (SSE) Streaming**
|
||||
- Real-time progress updates via `/stream/{uuid}`
|
||||
- Event types: `connected`, `started`, `progress`, `complete`, `error`
|
||||
- Non-blocking concurrent stream management
|
||||
- Automatic cleanup after completion
|
||||
|
||||
### 3. **Concurrent Web Crawler**
|
||||
- Goroutine-based parallel crawling
|
||||
- Configurable concurrency limit (default: 5 parallel requests)
|
||||
- Depth-limited crawling (1-5 levels)
|
||||
- Same-domain restriction with URL normalization
|
||||
- Duplicate detection and prevention
|
||||
|
||||
### 4. **Client Metadata Tracking**
|
||||
Automatically captured and stored in SQLite:
|
||||
- IP Address (with X-Forwarded-For support)
|
||||
- User-Agent string
|
||||
- Browser name & version (Chrome, Firefox, Safari, Edge, Opera)
|
||||
- Operating System (Windows, macOS, Linux, Android, iOS)
|
||||
- Device Type (Desktop, Mobile, Tablet)
|
||||
- Session ID (cookie-based persistence)
|
||||
- All cookies (JSON-encoded)
|
||||
- HTTP Referrer
|
||||
|
||||
### 5. **RESTful API Endpoints**
|
||||
```
|
||||
POST /generate-sitemap-xml → Start crawl, returns UUID
|
||||
GET /stream/{uuid} → SSE progress stream
|
||||
GET /download/{uuid} → Download XML sitemap
|
||||
GET /sites → List all sitemaps
|
||||
GET /sites/{id} → Get specific site
|
||||
DELETE /sites/{id} → Delete sitemap
|
||||
GET /health → Health check
|
||||
GET / → Serve frontend HTML
|
||||
```
|
||||
|
||||
### 6. **Beautiful Frontend UI**
|
||||
- Responsive gradient design
|
||||
- Real-time progress visualization
|
||||
- Live connection status indicator
|
||||
- Crawl statistics (pages found, depth, time)
|
||||
- Activity log with color-coded entries
|
||||
- Site management (view, download, delete)
|
||||
- Auto-protocol addition for URLs
|
||||
|
||||
## 🏗️ Architecture
|
||||
|
||||
```
|
||||
┌─────────────┐
|
||||
│ Browser │
|
||||
│ (Frontend) │
|
||||
└──────┬──────┘
|
||||
│ POST /generate-sitemap-xml
|
||||
↓
|
||||
┌──────────────────────────────────┐
|
||||
│ Go HTTP Server (Chi Router) │
|
||||
│ │
|
||||
│ ┌────────────────────────────┐ │
|
||||
│ │ Handler (handler.go) │ │
|
||||
│ │ - Generate UUID │ │
|
||||
│ │ - Extract metadata │ │
|
||||
│ │ - Create DB record │ │
|
||||
│ │ - Spawn crawler │ │
|
||||
│ │ - Return UUID immediately│ │
|
||||
│ └─────────────┬──────────────┘ │
|
||||
└────────────────┼────────────────┘
|
||||
│
|
||||
┌─────────┴─────────┐
|
||||
│ │
|
||||
↓ ↓
|
||||
┌──────────────┐ ┌───────────────┐
|
||||
│ StreamManager│ │ Crawler │
|
||||
│ │ │ │
|
||||
│ UUID → Chan │ │ Goroutines │
|
||||
│ Map storage │←──│ Concurrent │
|
||||
│ │ │ HTTP requests│
|
||||
└──────┬───────┘ └───────┬───────┘
|
||||
│ │
|
||||
│ SSE Events │ Save pages
|
||||
↓ ↓
|
||||
┌──────────────────────────────────┐
|
||||
│ SQLite Database │
|
||||
│ - sites (with metadata) │
|
||||
│ - pages (discovered URLs) │
|
||||
│ - sessions (tracking) │
|
||||
└──────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 📂 File Structure
|
||||
|
||||
```
|
||||
sitemap-api/
|
||||
├── main.go # HTTP server setup, routes
|
||||
├── go.mod # Go module dependencies
|
||||
├── go.sum # Dependency checksums
|
||||
│
|
||||
├── handlers/
|
||||
│ └── handler.go # All HTTP handlers
|
||||
│ - GenerateSitemapXML # POST endpoint
|
||||
│ - StreamSSE # SSE streaming
|
||||
│ - DownloadSitemap # XML generation
|
||||
│ - GetSites/GetSite # CRUD operations
|
||||
│ - DeleteSite # Cleanup
|
||||
│ - StreamManager # Concurrent stream management
|
||||
│
|
||||
├── crawler/
|
||||
│ └── crawler.go # Web crawler implementation
|
||||
│ - Crawl() # Main crawl logic
|
||||
│ - crawlURL() # Recursive URL processing
|
||||
│ - extractLinks() # HTML parsing
|
||||
│ - normalizeURL() # URL canonicalization
|
||||
│ - isSameDomain() # Domain checking
|
||||
│ - calculatePriority() # Sitemap priority
|
||||
│
|
||||
├── database/
|
||||
│ └── db.go # SQLite operations
|
||||
│ - NewDB() # Initialize DB
|
||||
│ - createTables() # Schema creation
|
||||
│ - CreateSite() # Insert site record
|
||||
│ - GetSiteByUUID() # Retrieve by UUID
|
||||
│ - UpdateSiteStatus() # Mark complete
|
||||
│ - AddPage() # Save discovered page
|
||||
│ - GetPagesBySiteID() # Retrieve all pages
|
||||
│ - DeleteSite() # Cascade delete
|
||||
│
|
||||
├── models/
|
||||
│ └── site.go # Data structures
|
||||
│ - Site # Site record
|
||||
│ - Page # Page record
|
||||
│ - Event # SSE event
|
||||
│ - ProgressData # Progress payload
|
||||
│ - CompleteData # Completion payload
|
||||
│ - ErrorData # Error payload
|
||||
│
|
||||
├── static/
|
||||
│ └── index.html # Frontend application
|
||||
│ - SitemapGenerator # Main class
|
||||
│ - generateSitemap() # Initiate crawl
|
||||
│ - connectToStream() # SSE connection
|
||||
│ - updateProgress() # Live updates
|
||||
│ - downloadSitemap() # File download
|
||||
│ - displaySites() # Results listing
|
||||
│
|
||||
├── README.md # Full documentation
|
||||
├── QUICKSTART.md # Quick start guide
|
||||
├── Makefile # Build automation
|
||||
├── Dockerfile # Container setup
|
||||
├── run.sh # Startup script
|
||||
├── .gitignore # Git exclusions
|
||||
└── .env.example # Environment template
|
||||
```
|
||||
|
||||
## 🔄 Request Flow
|
||||
|
||||
### 1. Generate Sitemap Request
|
||||
```
|
||||
User fills form → POST /generate-sitemap-xml
|
||||
↓
|
||||
Server generates UUID
|
||||
↓
|
||||
Extract IP, UA, cookies, session
|
||||
↓
|
||||
Save to database (status: processing)
|
||||
↓
|
||||
Create SSE channel in StreamManager
|
||||
↓
|
||||
Spawn goroutine for crawler (non-blocking)
|
||||
↓
|
||||
Return UUID immediately to frontend
|
||||
```
|
||||
|
||||
### 2. SSE Stream Connection
|
||||
```
|
||||
Frontend receives UUID → GET /stream/{uuid}
|
||||
↓
|
||||
StreamManager finds channel
|
||||
↓
|
||||
Send "connected" event
|
||||
↓
|
||||
Crawler sends events to channel
|
||||
↓
|
||||
Handler forwards to browser
|
||||
↓
|
||||
Frontend updates UI in real-time
|
||||
```
|
||||
|
||||
### 3. Crawler Operation
|
||||
```
|
||||
Start from root URL → Fetch HTML
|
||||
↓
|
||||
Parse <a> tags for links
|
||||
↓
|
||||
Check: same domain? not visited?
|
||||
↓
|
||||
Save page to database (URL, depth, priority)
|
||||
↓
|
||||
Send "progress" event via channel
|
||||
↓
|
||||
Spawn goroutines for child URLs
|
||||
↓
|
||||
Repeat until max depth reached
|
||||
↓
|
||||
Send "complete" event
|
||||
↓
|
||||
Close channel, cleanup resources
|
||||
```
|
||||
|
||||
### 4. Download Request
|
||||
```
|
||||
User clicks download → GET /download/{uuid}
|
||||
↓
|
||||
Lookup site by UUID
|
||||
↓
|
||||
Fetch all pages from database
|
||||
↓
|
||||
Generate XML sitemap
|
||||
↓
|
||||
Set Content-Disposition header
|
||||
↓
|
||||
Stream XML to browser
|
||||
```
|
||||
|
||||
## 🔐 Security Considerations
|
||||
|
||||
### Implemented
|
||||
- ✅ Same-domain restriction (no external crawling)
|
||||
- ✅ Max depth limit (prevents infinite loops)
|
||||
- ✅ HTTP timeout per request (10 seconds)
|
||||
- ✅ Duplicate URL prevention
|
||||
- ✅ SQLite prepared statements (SQL injection safe)
|
||||
- ✅ CORS middleware included
|
||||
|
||||
### Recommended for Production
|
||||
- [ ] Rate limiting per IP
|
||||
- [ ] Authentication/API keys
|
||||
- [ ] Input validation & sanitization
|
||||
- [ ] Request size limits
|
||||
- [ ] robots.txt respect
|
||||
- [ ] User-Agent identification
|
||||
- [ ] HTTPS enforcement
|
||||
- [ ] Firewall rules
|
||||
|
||||
## 🚀 Performance Optimization
|
||||
|
||||
### Current
|
||||
- Concurrent goroutines (5 parallel requests default)
|
||||
- Non-blocking SSE streams
|
||||
- Efficient channel-based communication
|
||||
- In-memory visited URL tracking
|
||||
- Database connection pooling
|
||||
|
||||
### Possible Improvements
|
||||
- Redis for distributed crawling
|
||||
- Worker pool pattern
|
||||
- Content caching
|
||||
- Incremental sitemap updates
|
||||
- Compression for large sitemaps
|
||||
- Database indexing optimization
|
||||
|
||||
## 📊 Database Schema
|
||||
|
||||
### sites table
|
||||
```sql
|
||||
- id (PK) - Auto-increment
|
||||
- uuid (UNIQUE) - Server-generated UUID
|
||||
- domain - Extracted from URL
|
||||
- url - Full starting URL
|
||||
- max_depth - Crawl depth limit
|
||||
- page_count - Total pages found
|
||||
- status - processing/completed/failed
|
||||
- ip_address - Client IP
|
||||
- user_agent - Full UA string
|
||||
- browser - Parsed browser name
|
||||
- browser_version - Version number
|
||||
- os - Operating system
|
||||
- device_type - Desktop/Mobile/Tablet
|
||||
- session_id - Cookie-based session
|
||||
- cookies - JSON of all cookies
|
||||
- referrer - HTTP Referer header
|
||||
- created_at - Timestamp
|
||||
- completed_at - Completion timestamp
|
||||
- last_crawled - Last activity
|
||||
```
|
||||
|
||||
### pages table
|
||||
```sql
|
||||
- id (PK) - Auto-increment
|
||||
- site_id (FK) - References sites(id)
|
||||
- url - Page URL (UNIQUE)
|
||||
- depth - Crawl depth level
|
||||
- last_modified - Discovery time
|
||||
- priority - Sitemap priority (0.0-1.0)
|
||||
- change_freq - monthly/weekly/daily/etc
|
||||
```
|
||||
|
||||
### sessions table
|
||||
```sql
|
||||
- id (PK) - Auto-increment
|
||||
- session_id (UNIQUE) - Session UUID
|
||||
- uuid (FK) - References sites(uuid)
|
||||
- ip_address - Client IP
|
||||
- created_at - First seen
|
||||
- last_activity - Last request
|
||||
```
|
||||
|
||||
## 🧪 Testing
|
||||
|
||||
### Manual Testing
|
||||
```bash
|
||||
# Terminal 1: Start server
|
||||
./run.sh
|
||||
|
||||
# Terminal 2: Test API
|
||||
curl -X POST http://localhost:8080/generate-sitemap-xml \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"url":"https://example.com","max_depth":2}'
|
||||
|
||||
# Terminal 3: Watch SSE stream
|
||||
curl -N http://localhost:8080/stream/{uuid}
|
||||
```
|
||||
|
||||
### Browser Testing
|
||||
1. Open multiple tabs to http://localhost:8080
|
||||
2. Start different crawls simultaneously
|
||||
3. Verify independent progress tracking
|
||||
4. Check database for metadata
|
||||
|
||||
### Database Verification
|
||||
```bash
|
||||
sqlite3 sitemap.db "SELECT * FROM sites ORDER BY created_at DESC LIMIT 5;"
|
||||
sqlite3 sitemap.db "SELECT COUNT(*) FROM pages WHERE site_id = 1;"
|
||||
```
|
||||
|
||||
## 📦 Deployment Options
|
||||
|
||||
### Option 1: Binary
|
||||
```bash
|
||||
go build -o sitemap-api
|
||||
./sitemap-api
|
||||
```
|
||||
|
||||
### Option 2: Docker
|
||||
```bash
|
||||
docker build -t sitemap-api .
|
||||
docker run -p 8080:8080 sitemap-api
|
||||
```
|
||||
|
||||
### Option 3: Systemd Service
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Sitemap Generator API
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=www-data
|
||||
WorkingDirectory=/opt/sitemap-api
|
||||
ExecStart=/opt/sitemap-api/sitemap-api
|
||||
Restart=always
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
## 🔧 Configuration
|
||||
|
||||
### Environment Variables
|
||||
```bash
|
||||
export PORT=8080 # Server port
|
||||
export DB_PATH=sitemap.db # Database file
|
||||
```
|
||||
|
||||
### Code Constants
|
||||
```go
|
||||
// crawler/crawler.go
|
||||
const maxConcurrent = 5 // Parallel requests
|
||||
const httpTimeout = 10 // Seconds
|
||||
|
||||
// handlers/handler.go
|
||||
const channelBuffer = 100 // SSE event buffer
|
||||
```
|
||||
|
||||
## 📝 XML Sitemap Format
|
||||
|
||||
Generated sitemaps follow the standard:
|
||||
```xml
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
<url>
|
||||
<loc>https://example.com/</loc>
|
||||
<lastmod>2024-02-05</lastmod>
|
||||
<changefreq>monthly</changefreq>
|
||||
<priority>1.0</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>https://example.com/about</loc>
|
||||
<lastmod>2024-02-05</lastmod>
|
||||
<changefreq>monthly</changefreq>
|
||||
<priority>0.8</priority>
|
||||
</url>
|
||||
</urlset>
|
||||
```
|
||||
|
||||
## 🎯 Success Criteria
|
||||
|
||||
All requirements met:
|
||||
- ✅ Go backend with excellent performance
|
||||
- ✅ Endpoint: `/generate-sitemap-xml` with UUID response
|
||||
- ✅ Endpoint: `/stream/{uuid}` for SSE
|
||||
- ✅ Endpoint: `/download/{uuid}` for XML
|
||||
- ✅ Multi-user concurrent support
|
||||
- ✅ Client metadata tracking (IP, browser, cookies, session)
|
||||
- ✅ SQLite storage
|
||||
- ✅ Root route `/` serves HTML
|
||||
- ✅ Real-time progress updates
|
||||
- ✅ Clean, maintainable code structure
|
||||
|
||||
## 📚 Next Steps
|
||||
|
||||
To extend this project:
|
||||
1. Add user authentication (JWT tokens)
|
||||
2. Implement rate limiting (go-rate package)
|
||||
3. Add robots.txt parsing (robotstxt.go package)
|
||||
4. Support sitemap index for large sites
|
||||
5. Add scheduling/cron jobs for recurring crawls
|
||||
6. Implement incremental updates
|
||||
7. Add webhook notifications
|
||||
8. Create admin dashboard
|
||||
9. Export to other formats (JSON, CSV)
|
||||
10. Add analytics and usage stats
|
||||
|
||||
---
|
||||
|
||||
**Ready to use! Just run `./run.sh` or `make run` to get started.**
|
||||
152
Documentation/QUICKSTART.md
Normal file
152
Documentation/QUICKSTART.md
Normal file
@@ -0,0 +1,152 @@
|
||||
# 🚀 Quick Start Guide
|
||||
|
||||
Get your sitemap generator running in 3 steps!
|
||||
|
||||
## Step 1: Install Go
|
||||
|
||||
If you don't have Go installed:
|
||||
- Download from https://golang.org/dl/
|
||||
- Install Go 1.21 or later
|
||||
- Verify: `go version`
|
||||
|
||||
## Step 2: Run the Application
|
||||
|
||||
### Option A: Using the run script (easiest)
|
||||
```bash
|
||||
cd sitemap-api
|
||||
./run.sh
|
||||
```
|
||||
|
||||
### Option B: Using Make
|
||||
```bash
|
||||
cd sitemap-api
|
||||
make run
|
||||
```
|
||||
|
||||
### Option C: Manual
|
||||
```bash
|
||||
cd sitemap-api
|
||||
go mod download
|
||||
go build -o sitemap-api .
|
||||
./sitemap-api
|
||||
```
|
||||
|
||||
## Step 3: Use the Application
|
||||
|
||||
1. **Open your browser** → http://localhost:8080
|
||||
|
||||
2. **Enter a URL** → e.g., `https://example.com`
|
||||
|
||||
3. **Set crawl depth** → 1-5 (default: 3)
|
||||
|
||||
4. **Click "Generate Sitemap"** → Watch real-time progress!
|
||||
|
||||
5. **Download XML** → Click the download button when complete
|
||||
|
||||
## Testing Multiple Users
|
||||
|
||||
Open multiple browser tabs to http://localhost:8080 and start different crawls simultaneously. Each will have its own UUID and progress stream!
|
||||
|
||||
## API Usage Examples
|
||||
|
||||
### Start a crawl
|
||||
```bash
|
||||
curl -X POST http://localhost:8080/generate-sitemap-xml \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"url": "https://example.com", "max_depth": 3}'
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"uuid": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"site_id": 123,
|
||||
"status": "processing",
|
||||
"stream_url": "/stream/550e8400-e29b-41d4-a716-446655440000",
|
||||
"message": "Sitemap generation started"
|
||||
}
|
||||
```
|
||||
|
||||
### Monitor progress (SSE)
|
||||
```bash
|
||||
curl http://localhost:8080/stream/550e8400-e29b-41d4-a716-446655440000
|
||||
```
|
||||
|
||||
### Download sitemap
|
||||
```bash
|
||||
curl http://localhost:8080/download/550e8400-e29b-41d4-a716-446655440000 -o sitemap.xml
|
||||
```
|
||||
|
||||
### List all sitemaps
|
||||
```bash
|
||||
curl http://localhost:8080/sites
|
||||
```
|
||||
|
||||
### Delete a sitemap
|
||||
```bash
|
||||
curl -X DELETE http://localhost:8080/sites/123
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Port already in use
|
||||
```bash
|
||||
PORT=3000 ./sitemap-api
|
||||
```
|
||||
|
||||
### Build errors
|
||||
```bash
|
||||
go mod tidy
|
||||
go clean -cache
|
||||
go build -o sitemap-api .
|
||||
```
|
||||
|
||||
### Database locked
|
||||
```bash
|
||||
rm sitemap.db
|
||||
./sitemap-api
|
||||
```
|
||||
|
||||
### CGO errors
|
||||
Make sure you have gcc installed:
|
||||
- **Ubuntu/Debian**: `sudo apt-get install build-essential`
|
||||
- **macOS**: `xcode-select --install`
|
||||
- **Windows**: Install MinGW or TDM-GCC
|
||||
|
||||
## Next Steps
|
||||
|
||||
- Read the full [README.md](README.md) for details
|
||||
- Customize the crawler in `crawler/crawler.go`
|
||||
- Add authentication to handlers
|
||||
- Deploy to production (see README for nginx config)
|
||||
- Add more metadata tracking
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
sitemap-api/
|
||||
├── main.go # Server entry point
|
||||
├── handlers/ # HTTP handlers & SSE
|
||||
├── crawler/ # Web crawler logic
|
||||
├── database/ # SQLite operations
|
||||
├── models/ # Data structures
|
||||
├── static/ # Frontend (served at /)
|
||||
├── README.md # Full documentation
|
||||
├── run.sh # Quick start script
|
||||
├── Makefile # Build commands
|
||||
└── Dockerfile # Container setup
|
||||
```
|
||||
|
||||
## Support
|
||||
|
||||
Having issues? Check:
|
||||
1. Go version >= 1.21
|
||||
2. Port 8080 is available
|
||||
3. SQLite3 is working
|
||||
4. All dependencies installed
|
||||
|
||||
Still stuck? Open an issue on GitHub!
|
||||
|
||||
---
|
||||
|
||||
**Built with ❤️ using Go + Goroutines + Server-Sent Events**
|
||||
213
Documentation/README.md
Normal file
213
Documentation/README.md
Normal file
@@ -0,0 +1,213 @@
|
||||
# XML Sitemap Generator API
|
||||
|
||||
A high-performance Go-based API for generating XML sitemaps with real-time progress tracking via Server-Sent Events (SSE).
|
||||
|
||||
## Features
|
||||
|
||||
- ✅ **Concurrent Web Crawling** - Fast sitemap generation using goroutines
|
||||
- ✅ **Real-time Progress** - SSE streaming for live updates
|
||||
- ✅ **Multi-user Support** - Handle multiple simultaneous crawls
|
||||
- ✅ **Client Metadata Tracking** - IP, browser, OS, session data stored in SQLite
|
||||
- ✅ **Clean REST API** - Simple endpoints for generate, stream, and download
|
||||
- ✅ **Professional UI** - Beautiful web interface included
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
sitemap-api/
|
||||
├── main.go # Entry point & HTTP server
|
||||
├── handlers/
|
||||
│ └── handler.go # HTTP handlers & SSE streaming
|
||||
├── crawler/
|
||||
│ └── crawler.go # Concurrent web crawler
|
||||
├── database/
|
||||
│ └── db.go # SQLite operations
|
||||
├── models/
|
||||
│ └── site.go # Data structures
|
||||
└── static/
|
||||
└── index.html # Frontend UI
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### `POST /generate-sitemap-xml`
|
||||
Start sitemap generation (backend generates UUID)
|
||||
|
||||
**Request:**
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"max_depth": 3
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"uuid": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"site_id": 123,
|
||||
"status": "processing",
|
||||
"stream_url": "/stream/550e8400-...",
|
||||
"message": "Sitemap generation started"
|
||||
}
|
||||
```
|
||||
|
||||
### `GET /stream/{uuid}`
|
||||
Server-Sent Events stream for real-time progress
|
||||
|
||||
**Events:** `connected`, `started`, `progress`, `complete`, `error`
|
||||
|
||||
### `GET /download/{uuid}`
|
||||
Download generated sitemap XML
|
||||
|
||||
### `GET /sites`
|
||||
List all generated sitemaps
|
||||
|
||||
### `GET /sites/{id}`
|
||||
Get specific site details
|
||||
|
||||
### `DELETE /sites/{id}`
|
||||
Delete a sitemap
|
||||
|
||||
### `GET /health`
|
||||
Health check endpoint
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
- Go 1.21+
|
||||
- SQLite3
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
# Clone/navigate to directory
|
||||
cd sitemap-api
|
||||
|
||||
# Install dependencies
|
||||
go mod download
|
||||
|
||||
# Build
|
||||
go build -o sitemap-api
|
||||
|
||||
# Run
|
||||
./sitemap-api
|
||||
```
|
||||
|
||||
Server starts on **http://localhost:8080**
|
||||
|
||||
### Or run directly:
|
||||
```bash
|
||||
go run main.go
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
1. Open http://localhost:8080 in your browser
|
||||
2. Enter a website URL
|
||||
3. Set crawl depth (1-5)
|
||||
4. Click "Generate Sitemap"
|
||||
5. Watch real-time progress
|
||||
6. Download XML when complete
|
||||
|
||||
## Database Schema
|
||||
|
||||
SQLite database (`sitemap.db`) stores:
|
||||
- **sites** - Crawl sessions with client metadata
|
||||
- **pages** - Discovered URLs with priority/frequency
|
||||
- **sessions** - User session tracking
|
||||
|
||||
## Environment Variables
|
||||
|
||||
- `PORT` - Server port (default: 8080)
|
||||
|
||||
Example:
|
||||
```bash
|
||||
PORT=3000 ./sitemap-api
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Frontend** sends POST to `/generate-sitemap-xml`
|
||||
2. **Backend** generates UUID, saves metadata, returns UUID
|
||||
3. **Frontend** connects to `/stream/{uuid}` for SSE updates
|
||||
4. **Crawler** runs in goroutine, sends events via channel
|
||||
5. **Handler** streams events to frontend in real-time
|
||||
6. **On completion**, sitemap available at `/download/{uuid}`
|
||||
|
||||
## Multi-User Concurrency
|
||||
|
||||
The `StreamManager` handles concurrent users:
|
||||
- Each UUID maps to a Go channel
|
||||
- Concurrent map with mutex for thread safety
|
||||
- Automatic cleanup after crawl completion
|
||||
- Supports unlimited simultaneous crawls
|
||||
|
||||
## Client Metadata Captured
|
||||
|
||||
- IP Address (with X-Forwarded-For support)
|
||||
- User-Agent
|
||||
- Browser name & version
|
||||
- Operating System
|
||||
- Device Type (Desktop/Mobile/Tablet)
|
||||
- Session ID (cookie-based)
|
||||
- All cookies (JSON)
|
||||
- Referrer
|
||||
|
||||
## Performance
|
||||
|
||||
- Concurrent crawling with goroutines
|
||||
- Configurable concurrency limit (default: 5 parallel requests)
|
||||
- Depth-limited to prevent infinite crawls
|
||||
- Same-domain restriction
|
||||
- Duplicate URL prevention
|
||||
- 10-second HTTP timeout per request
|
||||
|
||||
## Customization
|
||||
|
||||
### Adjust Concurrency
|
||||
Edit `crawler/crawler.go`:
|
||||
```go
|
||||
semaphore := make(chan struct{}, 10) // Increase to 10 concurrent
|
||||
```
|
||||
|
||||
### Change Priority Calculation
|
||||
Modify `calculatePriority()` in `crawler/crawler.go`
|
||||
|
||||
### Add Custom Metadata
|
||||
Extend `models.Site` struct and database schema
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### Recommendations:
|
||||
1. Use reverse proxy (nginx/caddy)
|
||||
2. Enable HTTPS
|
||||
3. Add rate limiting
|
||||
4. Configure CORS properly
|
||||
5. Use PostgreSQL for production (replace SQLite)
|
||||
6. Add authentication
|
||||
7. Implement cleanup jobs for old sitemaps
|
||||
|
||||
### Example nginx config:
|
||||
```nginx
|
||||
location / {
|
||||
proxy_pass http://localhost:8080;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection 'upgrade';
|
||||
proxy_set_header Host $host;
|
||||
proxy_cache_bypass $http_upgrade;
|
||||
|
||||
# SSE support
|
||||
proxy_buffering off;
|
||||
proxy_cache off;
|
||||
}
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions, please open a GitHub issue.
|
||||
280
Documentation/STRUCTURE.md
Normal file
280
Documentation/STRUCTURE.md
Normal file
@@ -0,0 +1,280 @@
|
||||
|
||||
# 📁 SITEMAP-API PROJECT STRUCTURE
|
||||
|
||||
## ROOT FILES
|
||||
```
|
||||
main.go ⚙️ HTTP server, routes, middleware
|
||||
go.mod 📦 Dependencies (chi, cors, uuid, sqlite3)
|
||||
run.sh 🚀 Quick start script
|
||||
Makefile 🔧 Build commands (run, build, clean, test)
|
||||
Dockerfile 🐳 Container configuration
|
||||
.gitignore 🚫 Git exclusions
|
||||
.env.example ⚙️ Environment template
|
||||
```
|
||||
|
||||
## DOCUMENTATION
|
||||
```
|
||||
README.md 📖 Full API documentation
|
||||
QUICKSTART.md ⏱️ 3-step quick start guide
|
||||
PROJECT_OVERVIEW.md 📊 Complete implementation details
|
||||
```
|
||||
|
||||
## CODE STRUCTURE
|
||||
|
||||
### handlers/
|
||||
```
|
||||
└── handler.go 🎯 HTTP REQUEST HANDLERS
|
||||
- GenerateSitemapXML() POST /generate-sitemap-xml
|
||||
- StreamSSE() GET /stream/{uuid}
|
||||
- DownloadSitemap() GET /download/{uuid}
|
||||
- GetSites() GET /sites
|
||||
- GetSite() GET /sites/{id}
|
||||
- DeleteSite() DELETE /sites/{id}
|
||||
- Health() GET /health
|
||||
|
||||
🔄 STREAM MANAGER
|
||||
- NewStreamManager() Concurrent SSE handling
|
||||
- CreateStream() Per-UUID channels
|
||||
- GetStream() Retrieve channel
|
||||
- CloseStream() Cleanup
|
||||
|
||||
🔍 METADATA EXTRACTORS
|
||||
- getClientIP() IP address
|
||||
- parseBrowser() Browser detection
|
||||
- parseOS() OS detection
|
||||
- parseDeviceType() Device detection
|
||||
- extractCookies() Cookie parsing
|
||||
- getOrCreateSession() Session management
|
||||
```
|
||||
|
||||
### crawler/
|
||||
```
|
||||
└── crawler.go 🕷️ WEB CRAWLER ENGINE
|
||||
- NewCrawler() Initialize crawler
|
||||
- Crawl() Main crawl orchestrator
|
||||
- crawlURL() Recursive URL processing
|
||||
- extractLinks() HTML link extraction
|
||||
- resolveURL() Relative → absolute
|
||||
- normalizeURL() URL canonicalization
|
||||
- isSameDomain() Domain validation
|
||||
- calculatePriority() Sitemap priority (0-1.0)
|
||||
- sendEvent() SSE event emission
|
||||
```
|
||||
|
||||
### database/
|
||||
```
|
||||
└── db.go 💾 SQLITE DATABASE LAYER
|
||||
- NewDB() Initialize DB
|
||||
- createTables() Schema setup
|
||||
- CreateSite() Insert site
|
||||
- GetSiteByUUID() Fetch by UUID
|
||||
- GetSiteByID() Fetch by ID
|
||||
- GetAllSites() List all
|
||||
- UpdateSiteStatus() Mark complete/failed
|
||||
- DeleteSite() Remove site
|
||||
- AddPage() Insert page
|
||||
- GetPagesBySiteID() Fetch pages
|
||||
```
|
||||
|
||||
### models/
|
||||
```
|
||||
└── site.go 📋 DATA STRUCTURES
|
||||
- Site Main site record
|
||||
- Page Discovered page
|
||||
- Event SSE event
|
||||
- ProgressData Progress payload
|
||||
- CompleteData Completion payload
|
||||
- ErrorData Error payload
|
||||
```
|
||||
|
||||
### static/
|
||||
```
|
||||
└── index.html 🎨 FRONTEND APPLICATION
|
||||
HTML:
|
||||
- Form section URL input, depth selector
|
||||
- Progress section Live stats, progress bar
|
||||
- Log section Activity console
|
||||
- Results section Site list, download buttons
|
||||
|
||||
JavaScript:
|
||||
- SitemapGenerator class Main controller
|
||||
- generateSitemap() POST to API
|
||||
- connectToStream() SSE connection
|
||||
- updateProgress() Live UI updates
|
||||
- downloadSitemap() File download
|
||||
- loadExistingSites() Fetch site list
|
||||
- displaySites() Render results
|
||||
```
|
||||
|
||||
## RUNTIME GENERATED
|
||||
```
|
||||
sitemap.db 💾 SQLite database (auto-created on first run)
|
||||
sitemap.db-journal 📝 SQLite temp file
|
||||
sitemap-api ⚙️ Compiled binary (from: go build)
|
||||
go.sum 🔒 Dependency checksums (from: go mod download)
|
||||
```
|
||||
|
||||
## FILE COUNTS
|
||||
```
|
||||
Go source files: 5 files
|
||||
HTML files: 1 file
|
||||
Documentation: 3 files
|
||||
Config files: 6 files
|
||||
─────────────────────────
|
||||
Total: 15 files
|
||||
```
|
||||
|
||||
## LINES OF CODE
|
||||
```
|
||||
handlers/handler.go ~600 lines (HTTP handlers, SSE, metadata)
|
||||
crawler/crawler.go ~250 lines (Concurrent crawler)
|
||||
database/db.go ~250 lines (SQLite operations)
|
||||
models/site.go ~50 lines (Data structures)
|
||||
main.go ~70 lines (Server setup)
|
||||
static/index.html ~850 lines (Full UI with CSS & JS)
|
||||
─────────────────────────────────────
|
||||
Total: ~2,070 lines
|
||||
```
|
||||
|
||||
## KEY DEPENDENCIES (go.mod)
|
||||
```
|
||||
github.com/go-chi/chi/v5 Router & middleware
|
||||
github.com/go-chi/cors CORS support
|
||||
github.com/google/uuid UUID generation
|
||||
github.com/mattn/go-sqlite3 SQLite driver
|
||||
golang.org/x/net HTML parsing
|
||||
```
|
||||
|
||||
## VISUAL TREE
|
||||
```
|
||||
sitemap-api/
|
||||
│
|
||||
├── 📄 main.go # Entry point & server
|
||||
├── 📦 go.mod # Dependencies
|
||||
├── 🚀 run.sh # Quick start
|
||||
├── 🔧 Makefile # Build commands
|
||||
├── 🐳 Dockerfile # Containerization
|
||||
├── ⚙️ .env.example # Config template
|
||||
├── 🚫 .gitignore # Git exclusions
|
||||
│
|
||||
├── 📚 Documentation/
|
||||
│ ├── README.md # Full docs
|
||||
│ ├── QUICKSTART.md # Quick start
|
||||
│ └── PROJECT_OVERVIEW.md # Implementation details
|
||||
│
|
||||
├── 🎯 handlers/
|
||||
│ └── handler.go # All HTTP endpoints + SSE
|
||||
│
|
||||
├── 🕷️ crawler/
|
||||
│ └── crawler.go # Concurrent web crawler
|
||||
│
|
||||
├── 💾 database/
|
||||
│ └── db.go # SQLite operations
|
||||
│
|
||||
├── 📋 models/
|
||||
│ └── site.go # Data structures
|
||||
│
|
||||
└── 🎨 static/
|
||||
└── index.html # Frontend UI
|
||||
```
|
||||
|
||||
## DATA FLOW
|
||||
```
|
||||
User Browser
|
||||
│
|
||||
├─► POST /generate-sitemap-xml
|
||||
│ └─► handlers.GenerateSitemapXML()
|
||||
│ ├─► Generate UUID
|
||||
│ ├─► Extract metadata (IP, browser, etc)
|
||||
│ ├─► database.CreateSite()
|
||||
│ ├─► streamManager.CreateStream(uuid)
|
||||
│ ├─► go crawler.Crawl() [goroutine]
|
||||
│ └─► Return {uuid, site_id, stream_url}
|
||||
│
|
||||
├─► GET /stream/{uuid}
|
||||
│ └─► handlers.StreamSSE()
|
||||
│ └─► streamManager.GetStream(uuid)
|
||||
│ └─► Forward events to browser
|
||||
│
|
||||
└─► GET /download/{uuid}
|
||||
└─► handlers.DownloadSitemap()
|
||||
├─► database.GetSiteByUUID()
|
||||
├─► database.GetPagesBySiteID()
|
||||
└─► Generate XML sitemap
|
||||
|
||||
Crawler (goroutine)
|
||||
│
|
||||
├─► Fetch URL
|
||||
├─► Parse HTML links
|
||||
├─► database.AddPage()
|
||||
├─► Send SSE progress event
|
||||
└─► Recursively crawl children (with goroutines)
|
||||
```
|
||||
|
||||
## DATABASE SCHEMA
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ sites │
|
||||
├─────────────────┤
|
||||
│ id │ PK
|
||||
│ uuid │ UNIQUE (server-generated)
|
||||
│ domain │
|
||||
│ url │
|
||||
│ max_depth │
|
||||
│ page_count │
|
||||
│ status │ (processing/completed/failed)
|
||||
│ ip_address │ (client metadata)
|
||||
│ user_agent │
|
||||
│ browser │
|
||||
│ browser_version │
|
||||
│ os │
|
||||
│ device_type │
|
||||
│ session_id │
|
||||
│ cookies │ (JSON)
|
||||
│ referrer │
|
||||
│ created_at │
|
||||
│ completed_at │
|
||||
│ last_crawled │
|
||||
└─────────────────┘
|
||||
│
|
||||
│ 1:N
|
||||
↓
|
||||
┌─────────────────┐
|
||||
│ pages │
|
||||
├─────────────────┤
|
||||
│ id │ PK
|
||||
│ site_id │ FK → sites.id
|
||||
│ url │ UNIQUE
|
||||
│ depth │
|
||||
│ last_modified │
|
||||
│ priority │ (0.0 - 1.0)
|
||||
│ change_freq │ (monthly/weekly/etc)
|
||||
└─────────────────┘
|
||||
|
||||
┌─────────────────┐
|
||||
│ sessions │
|
||||
├─────────────────┤
|
||||
│ id │ PK
|
||||
│ session_id │ UNIQUE
|
||||
│ uuid │ FK → sites.uuid
|
||||
│ ip_address │
|
||||
│ created_at │
|
||||
│ last_activity │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
## CONCURRENCY MODEL
|
||||
```
|
||||
StreamManager
|
||||
├─► map[uuid]chan Event (thread-safe with mutex)
|
||||
│
|
||||
└─► Per-UUID Channel
|
||||
└─► Event stream to browser
|
||||
|
||||
Crawler
|
||||
├─► Main goroutine (Crawl)
|
||||
│ └─► Spawns goroutines for each URL
|
||||
│
|
||||
└─► Semaphore (5 concurrent max)
|
||||
└─► Controls parallel requests
|
||||
```
|
||||
Reference in New Issue
Block a user