This commit is contained in:
Kar
2026-02-05 19:23:03 +05:30
parent 10b19d4ed6
commit b80e988191
10 changed files with 284 additions and 0 deletions

4
.gitignore vendored Normal file
View File

@@ -0,0 +1,4 @@
sitemap.db
sitemap.db-journal
sitemap-api
go.sum

280
Documentation/STRUCTURE.md Normal file
View File

@@ -0,0 +1,280 @@
# 📁 SITEMAP-API PROJECT STRUCTURE
## ROOT FILES
```
main.go ⚙️ HTTP server, routes, middleware
go.mod 📦 Dependencies (chi, cors, uuid, sqlite3)
run.sh 🚀 Quick start script
Makefile 🔧 Build commands (run, build, clean, test)
Dockerfile 🐳 Container configuration
.gitignore 🚫 Git exclusions
.env.example ⚙️ Environment template
```
## DOCUMENTATION
```
README.md 📖 Full API documentation
QUICKSTART.md ⏱️ 3-step quick start guide
PROJECT_OVERVIEW.md 📊 Complete implementation details
```
## CODE STRUCTURE
### handlers/
```
└── handler.go 🎯 HTTP REQUEST HANDLERS
- GenerateSitemapXML() POST /generate-sitemap-xml
- StreamSSE() GET /stream/{uuid}
- DownloadSitemap() GET /download/{uuid}
- GetSites() GET /sites
- GetSite() GET /sites/{id}
- DeleteSite() DELETE /sites/{id}
- Health() GET /health
🔄 STREAM MANAGER
- NewStreamManager() Concurrent SSE handling
- CreateStream() Per-UUID channels
- GetStream() Retrieve channel
- CloseStream() Cleanup
🔍 METADATA EXTRACTORS
- getClientIP() IP address
- parseBrowser() Browser detection
- parseOS() OS detection
- parseDeviceType() Device detection
- extractCookies() Cookie parsing
- getOrCreateSession() Session management
```
### crawler/
```
└── crawler.go 🕷️ WEB CRAWLER ENGINE
- NewCrawler() Initialize crawler
- Crawl() Main crawl orchestrator
- crawlURL() Recursive URL processing
- extractLinks() HTML link extraction
- resolveURL() Relative → absolute
- normalizeURL() URL canonicalization
- isSameDomain() Domain validation
- calculatePriority() Sitemap priority (0-1.0)
- sendEvent() SSE event emission
```
### database/
```
└── db.go 💾 SQLITE DATABASE LAYER
- NewDB() Initialize DB
- createTables() Schema setup
- CreateSite() Insert site
- GetSiteByUUID() Fetch by UUID
- GetSiteByID() Fetch by ID
- GetAllSites() List all
- UpdateSiteStatus() Mark complete/failed
- DeleteSite() Remove site
- AddPage() Insert page
- GetPagesBySiteID() Fetch pages
```
### models/
```
└── site.go 📋 DATA STRUCTURES
- Site Main site record
- Page Discovered page
- Event SSE event
- ProgressData Progress payload
- CompleteData Completion payload
- ErrorData Error payload
```
### static/
```
└── index.html 🎨 FRONTEND APPLICATION
HTML:
- Form section URL input, depth selector
- Progress section Live stats, progress bar
- Log section Activity console
- Results section Site list, download buttons
JavaScript:
- SitemapGenerator class Main controller
- generateSitemap() POST to API
- connectToStream() SSE connection
- updateProgress() Live UI updates
- downloadSitemap() File download
- loadExistingSites() Fetch site list
- displaySites() Render results
```
## RUNTIME GENERATED
```
sitemap.db 💾 SQLite database (auto-created on first run)
sitemap.db-journal 📝 SQLite temp file
sitemap-api ⚙️ Compiled binary (from: go build)
go.sum 🔒 Dependency checksums (from: go mod download)
```
## FILE COUNTS
```
Go source files: 5 files
HTML files: 1 file
Documentation: 3 files
Config files: 6 files
─────────────────────────
Total: 15 files
```
## LINES OF CODE
```
handlers/handler.go ~600 lines (HTTP handlers, SSE, metadata)
crawler/crawler.go ~250 lines (Concurrent crawler)
database/db.go ~250 lines (SQLite operations)
models/site.go ~50 lines (Data structures)
main.go ~70 lines (Server setup)
static/index.html ~850 lines (Full UI with CSS & JS)
─────────────────────────────────────
Total: ~2,070 lines
```
## KEY DEPENDENCIES (go.mod)
```
github.com/go-chi/chi/v5 Router & middleware
github.com/go-chi/cors CORS support
github.com/google/uuid UUID generation
github.com/mattn/go-sqlite3 SQLite driver
golang.org/x/net HTML parsing
```
## VISUAL TREE
```
sitemap-api/
├── 📄 main.go # Entry point & server
├── 📦 go.mod # Dependencies
├── 🚀 run.sh # Quick start
├── 🔧 Makefile # Build commands
├── 🐳 Dockerfile # Containerization
├── ⚙️ .env.example # Config template
├── 🚫 .gitignore # Git exclusions
├── 📚 Documentation/
│ ├── README.md # Full docs
│ ├── QUICKSTART.md # Quick start
│ └── PROJECT_OVERVIEW.md # Implementation details
├── 🎯 handlers/
│ └── handler.go # All HTTP endpoints + SSE
├── 🕷️ crawler/
│ └── crawler.go # Concurrent web crawler
├── 💾 database/
│ └── db.go # SQLite operations
├── 📋 models/
│ └── site.go # Data structures
└── 🎨 static/
└── index.html # Frontend UI
```
## DATA FLOW
```
User Browser
├─► POST /generate-sitemap-xml
│ └─► handlers.GenerateSitemapXML()
│ ├─► Generate UUID
│ ├─► Extract metadata (IP, browser, etc)
│ ├─► database.CreateSite()
│ ├─► streamManager.CreateStream(uuid)
│ ├─► go crawler.Crawl() [goroutine]
│ └─► Return {uuid, site_id, stream_url}
├─► GET /stream/{uuid}
│ └─► handlers.StreamSSE()
│ └─► streamManager.GetStream(uuid)
│ └─► Forward events to browser
└─► GET /download/{uuid}
└─► handlers.DownloadSitemap()
├─► database.GetSiteByUUID()
├─► database.GetPagesBySiteID()
└─► Generate XML sitemap
Crawler (goroutine)
├─► Fetch URL
├─► Parse HTML links
├─► database.AddPage()
├─► Send SSE progress event
└─► Recursively crawl children (with goroutines)
```
## DATABASE SCHEMA
```
┌─────────────────┐
│ sites │
├─────────────────┤
│ id │ PK
│ uuid │ UNIQUE (server-generated)
│ domain │
│ url │
│ max_depth │
│ page_count │
│ status │ (processing/completed/failed)
│ ip_address │ (client metadata)
│ user_agent │
│ browser │
│ browser_version │
│ os │
│ device_type │
│ session_id │
│ cookies │ (JSON)
│ referrer │
│ created_at │
│ completed_at │
│ last_crawled │
└─────────────────┘
│ 1:N
┌─────────────────┐
│ pages │
├─────────────────┤
│ id │ PK
│ site_id │ FK → sites.id
│ url │ UNIQUE
│ depth │
│ last_modified │
│ priority │ (0.0 - 1.0)
│ change_freq │ (monthly/weekly/etc)
└─────────────────┘
┌─────────────────┐
│ sessions │
├─────────────────┤
│ id │ PK
│ session_id │ UNIQUE
│ uuid │ FK → sites.uuid
│ ip_address │
│ created_at │
│ last_activity │
└─────────────────┘
```
## CONCURRENCY MODEL
```
StreamManager
├─► map[uuid]chan Event (thread-safe with mutex)
└─► Per-UUID Channel
└─► Event stream to browser
Crawler
├─► Main goroutine (Crawl)
│ └─► Spawns goroutines for each URL
└─► Semaphore (5 concurrent max)
└─► Controls parallel requests
```