init
This commit is contained in:
4
.gitignore
vendored
Normal file
4
.gitignore
vendored
Normal file
@@ -0,0 +1,4 @@
|
||||
sitemap.db
|
||||
sitemap.db-journal
|
||||
sitemap-api
|
||||
go.sum
|
||||
280
Documentation/STRUCTURE.md
Normal file
280
Documentation/STRUCTURE.md
Normal file
@@ -0,0 +1,280 @@
|
||||
|
||||
# 📁 SITEMAP-API PROJECT STRUCTURE
|
||||
|
||||
## ROOT FILES
|
||||
```
|
||||
main.go ⚙️ HTTP server, routes, middleware
|
||||
go.mod 📦 Dependencies (chi, cors, uuid, sqlite3)
|
||||
run.sh 🚀 Quick start script
|
||||
Makefile 🔧 Build commands (run, build, clean, test)
|
||||
Dockerfile 🐳 Container configuration
|
||||
.gitignore 🚫 Git exclusions
|
||||
.env.example ⚙️ Environment template
|
||||
```
|
||||
|
||||
## DOCUMENTATION
|
||||
```
|
||||
README.md 📖 Full API documentation
|
||||
QUICKSTART.md ⏱️ 3-step quick start guide
|
||||
PROJECT_OVERVIEW.md 📊 Complete implementation details
|
||||
```
|
||||
|
||||
## CODE STRUCTURE
|
||||
|
||||
### handlers/
|
||||
```
|
||||
└── handler.go 🎯 HTTP REQUEST HANDLERS
|
||||
- GenerateSitemapXML() POST /generate-sitemap-xml
|
||||
- StreamSSE() GET /stream/{uuid}
|
||||
- DownloadSitemap() GET /download/{uuid}
|
||||
- GetSites() GET /sites
|
||||
- GetSite() GET /sites/{id}
|
||||
- DeleteSite() DELETE /sites/{id}
|
||||
- Health() GET /health
|
||||
|
||||
🔄 STREAM MANAGER
|
||||
- NewStreamManager() Concurrent SSE handling
|
||||
- CreateStream() Per-UUID channels
|
||||
- GetStream() Retrieve channel
|
||||
- CloseStream() Cleanup
|
||||
|
||||
🔍 METADATA EXTRACTORS
|
||||
- getClientIP() IP address
|
||||
- parseBrowser() Browser detection
|
||||
- parseOS() OS detection
|
||||
- parseDeviceType() Device detection
|
||||
- extractCookies() Cookie parsing
|
||||
- getOrCreateSession() Session management
|
||||
```
|
||||
|
||||
### crawler/
|
||||
```
|
||||
└── crawler.go 🕷️ WEB CRAWLER ENGINE
|
||||
- NewCrawler() Initialize crawler
|
||||
- Crawl() Main crawl orchestrator
|
||||
- crawlURL() Recursive URL processing
|
||||
- extractLinks() HTML link extraction
|
||||
- resolveURL() Relative → absolute
|
||||
- normalizeURL() URL canonicalization
|
||||
- isSameDomain() Domain validation
|
||||
- calculatePriority() Sitemap priority (0-1.0)
|
||||
- sendEvent() SSE event emission
|
||||
```
|
||||
|
||||
### database/
|
||||
```
|
||||
└── db.go 💾 SQLITE DATABASE LAYER
|
||||
- NewDB() Initialize DB
|
||||
- createTables() Schema setup
|
||||
- CreateSite() Insert site
|
||||
- GetSiteByUUID() Fetch by UUID
|
||||
- GetSiteByID() Fetch by ID
|
||||
- GetAllSites() List all
|
||||
- UpdateSiteStatus() Mark complete/failed
|
||||
- DeleteSite() Remove site
|
||||
- AddPage() Insert page
|
||||
- GetPagesBySiteID() Fetch pages
|
||||
```
|
||||
|
||||
### models/
|
||||
```
|
||||
└── site.go 📋 DATA STRUCTURES
|
||||
- Site Main site record
|
||||
- Page Discovered page
|
||||
- Event SSE event
|
||||
- ProgressData Progress payload
|
||||
- CompleteData Completion payload
|
||||
- ErrorData Error payload
|
||||
```
|
||||
|
||||
### static/
|
||||
```
|
||||
└── index.html 🎨 FRONTEND APPLICATION
|
||||
HTML:
|
||||
- Form section URL input, depth selector
|
||||
- Progress section Live stats, progress bar
|
||||
- Log section Activity console
|
||||
- Results section Site list, download buttons
|
||||
|
||||
JavaScript:
|
||||
- SitemapGenerator class Main controller
|
||||
- generateSitemap() POST to API
|
||||
- connectToStream() SSE connection
|
||||
- updateProgress() Live UI updates
|
||||
- downloadSitemap() File download
|
||||
- loadExistingSites() Fetch site list
|
||||
- displaySites() Render results
|
||||
```
|
||||
|
||||
## RUNTIME GENERATED
|
||||
```
|
||||
sitemap.db 💾 SQLite database (auto-created on first run)
|
||||
sitemap.db-journal 📝 SQLite temp file
|
||||
sitemap-api ⚙️ Compiled binary (from: go build)
|
||||
go.sum 🔒 Dependency checksums (from: go mod download)
|
||||
```
|
||||
|
||||
## FILE COUNTS
|
||||
```
|
||||
Go source files: 5 files
|
||||
HTML files: 1 file
|
||||
Documentation: 3 files
|
||||
Config files: 6 files
|
||||
─────────────────────────
|
||||
Total: 15 files
|
||||
```
|
||||
|
||||
## LINES OF CODE
|
||||
```
|
||||
handlers/handler.go ~600 lines (HTTP handlers, SSE, metadata)
|
||||
crawler/crawler.go ~250 lines (Concurrent crawler)
|
||||
database/db.go ~250 lines (SQLite operations)
|
||||
models/site.go ~50 lines (Data structures)
|
||||
main.go ~70 lines (Server setup)
|
||||
static/index.html ~850 lines (Full UI with CSS & JS)
|
||||
─────────────────────────────────────
|
||||
Total: ~2,070 lines
|
||||
```
|
||||
|
||||
## KEY DEPENDENCIES (go.mod)
|
||||
```
|
||||
github.com/go-chi/chi/v5 Router & middleware
|
||||
github.com/go-chi/cors CORS support
|
||||
github.com/google/uuid UUID generation
|
||||
github.com/mattn/go-sqlite3 SQLite driver
|
||||
golang.org/x/net HTML parsing
|
||||
```
|
||||
|
||||
## VISUAL TREE
|
||||
```
|
||||
sitemap-api/
|
||||
│
|
||||
├── 📄 main.go # Entry point & server
|
||||
├── 📦 go.mod # Dependencies
|
||||
├── 🚀 run.sh # Quick start
|
||||
├── 🔧 Makefile # Build commands
|
||||
├── 🐳 Dockerfile # Containerization
|
||||
├── ⚙️ .env.example # Config template
|
||||
├── 🚫 .gitignore # Git exclusions
|
||||
│
|
||||
├── 📚 Documentation/
|
||||
│ ├── README.md # Full docs
|
||||
│ ├── QUICKSTART.md # Quick start
|
||||
│ └── PROJECT_OVERVIEW.md # Implementation details
|
||||
│
|
||||
├── 🎯 handlers/
|
||||
│ └── handler.go # All HTTP endpoints + SSE
|
||||
│
|
||||
├── 🕷️ crawler/
|
||||
│ └── crawler.go # Concurrent web crawler
|
||||
│
|
||||
├── 💾 database/
|
||||
│ └── db.go # SQLite operations
|
||||
│
|
||||
├── 📋 models/
|
||||
│ └── site.go # Data structures
|
||||
│
|
||||
└── 🎨 static/
|
||||
└── index.html # Frontend UI
|
||||
```
|
||||
|
||||
## DATA FLOW
|
||||
```
|
||||
User Browser
|
||||
│
|
||||
├─► POST /generate-sitemap-xml
|
||||
│ └─► handlers.GenerateSitemapXML()
|
||||
│ ├─► Generate UUID
|
||||
│ ├─► Extract metadata (IP, browser, etc)
|
||||
│ ├─► database.CreateSite()
|
||||
│ ├─► streamManager.CreateStream(uuid)
|
||||
│ ├─► go crawler.Crawl() [goroutine]
|
||||
│ └─► Return {uuid, site_id, stream_url}
|
||||
│
|
||||
├─► GET /stream/{uuid}
|
||||
│ └─► handlers.StreamSSE()
|
||||
│ └─► streamManager.GetStream(uuid)
|
||||
│ └─► Forward events to browser
|
||||
│
|
||||
└─► GET /download/{uuid}
|
||||
└─► handlers.DownloadSitemap()
|
||||
├─► database.GetSiteByUUID()
|
||||
├─► database.GetPagesBySiteID()
|
||||
└─► Generate XML sitemap
|
||||
|
||||
Crawler (goroutine)
|
||||
│
|
||||
├─► Fetch URL
|
||||
├─► Parse HTML links
|
||||
├─► database.AddPage()
|
||||
├─► Send SSE progress event
|
||||
└─► Recursively crawl children (with goroutines)
|
||||
```
|
||||
|
||||
## DATABASE SCHEMA
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ sites │
|
||||
├─────────────────┤
|
||||
│ id │ PK
|
||||
│ uuid │ UNIQUE (server-generated)
|
||||
│ domain │
|
||||
│ url │
|
||||
│ max_depth │
|
||||
│ page_count │
|
||||
│ status │ (processing/completed/failed)
|
||||
│ ip_address │ (client metadata)
|
||||
│ user_agent │
|
||||
│ browser │
|
||||
│ browser_version │
|
||||
│ os │
|
||||
│ device_type │
|
||||
│ session_id │
|
||||
│ cookies │ (JSON)
|
||||
│ referrer │
|
||||
│ created_at │
|
||||
│ completed_at │
|
||||
│ last_crawled │
|
||||
└─────────────────┘
|
||||
│
|
||||
│ 1:N
|
||||
↓
|
||||
┌─────────────────┐
|
||||
│ pages │
|
||||
├─────────────────┤
|
||||
│ id │ PK
|
||||
│ site_id │ FK → sites.id
|
||||
│ url │ UNIQUE
|
||||
│ depth │
|
||||
│ last_modified │
|
||||
│ priority │ (0.0 - 1.0)
|
||||
│ change_freq │ (monthly/weekly/etc)
|
||||
└─────────────────┘
|
||||
|
||||
┌─────────────────┐
|
||||
│ sessions │
|
||||
├─────────────────┤
|
||||
│ id │ PK
|
||||
│ session_id │ UNIQUE
|
||||
│ uuid │ FK → sites.uuid
|
||||
│ ip_address │
|
||||
│ created_at │
|
||||
│ last_activity │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
## CONCURRENCY MODEL
|
||||
```
|
||||
StreamManager
|
||||
├─► map[uuid]chan Event (thread-safe with mutex)
|
||||
│
|
||||
└─► Per-UUID Channel
|
||||
└─► Event stream to browser
|
||||
|
||||
Crawler
|
||||
├─► Main goroutine (Crawl)
|
||||
│ └─► Spawns goroutines for each URL
|
||||
│
|
||||
└─► Semaphore (5 concurrent max)
|
||||
└─► Controls parallel requests
|
||||
```
|
||||
Reference in New Issue
Block a user