Files
sitemap-generator-xml-golang/Documentation/STRUCTURE.md
2026-02-05 19:23:03 +05:30

10 KiB

📁 SITEMAP-API PROJECT STRUCTURE

ROOT FILES

main.go              ⚙️  HTTP server, routes, middleware
go.mod               📦  Dependencies (chi, cors, uuid, sqlite3)
run.sh               🚀  Quick start script
Makefile             🔧  Build commands (run, build, clean, test)
Dockerfile           🐳  Container configuration
.gitignore           🚫  Git exclusions
.env.example         ⚙️  Environment template

DOCUMENTATION

README.md            📖  Full API documentation
QUICKSTART.md        ⏱️  3-step quick start guide
PROJECT_OVERVIEW.md  📊  Complete implementation details

CODE STRUCTURE

handlers/

└── handler.go       🎯  HTTP REQUEST HANDLERS
                        - GenerateSitemapXML()     POST /generate-sitemap-xml
                        - StreamSSE()              GET  /stream/{uuid}
                        - DownloadSitemap()        GET  /download/{uuid}
                        - GetSites()               GET  /sites
                        - GetSite()                GET  /sites/{id}
                        - DeleteSite()             DELETE /sites/{id}
                        - Health()                 GET  /health
                        
                        🔄 STREAM MANAGER
                        - NewStreamManager()       Concurrent SSE handling
                        - CreateStream()           Per-UUID channels
                        - GetStream()              Retrieve channel
                        - CloseStream()            Cleanup
                        
                        🔍 METADATA EXTRACTORS
                        - getClientIP()            IP address
                        - parseBrowser()           Browser detection
                        - parseOS()                OS detection
                        - parseDeviceType()        Device detection
                        - extractCookies()         Cookie parsing
                        - getOrCreateSession()     Session management

crawler/

└── crawler.go       🕷️  WEB CRAWLER ENGINE
                        - NewCrawler()             Initialize crawler
                        - Crawl()                  Main crawl orchestrator
                        - crawlURL()               Recursive URL processing
                        - extractLinks()           HTML link extraction
                        - resolveURL()             Relative → absolute
                        - normalizeURL()           URL canonicalization
                        - isSameDomain()           Domain validation
                        - calculatePriority()      Sitemap priority (0-1.0)
                        - sendEvent()              SSE event emission

database/

└── db.go            💾  SQLITE DATABASE LAYER
                        - NewDB()                  Initialize DB
                        - createTables()           Schema setup
                        - CreateSite()             Insert site
                        - GetSiteByUUID()          Fetch by UUID
                        - GetSiteByID()            Fetch by ID
                        - GetAllSites()            List all
                        - UpdateSiteStatus()       Mark complete/failed
                        - DeleteSite()             Remove site
                        - AddPage()                Insert page
                        - GetPagesBySiteID()       Fetch pages

models/

└── site.go          📋  DATA STRUCTURES
                        - Site                     Main site record
                        - Page                     Discovered page
                        - Event                    SSE event
                        - ProgressData             Progress payload
                        - CompleteData             Completion payload
                        - ErrorData                Error payload

static/

└── index.html       🎨  FRONTEND APPLICATION
                        HTML:
                        - Form section             URL input, depth selector
                        - Progress section         Live stats, progress bar
                        - Log section              Activity console
                        - Results section          Site list, download buttons
                        
                        JavaScript:
                        - SitemapGenerator class   Main controller
                        - generateSitemap()        POST to API
                        - connectToStream()        SSE connection
                        - updateProgress()         Live UI updates
                        - downloadSitemap()        File download
                        - loadExistingSites()      Fetch site list
                        - displaySites()           Render results

RUNTIME GENERATED

sitemap.db           💾  SQLite database (auto-created on first run)
sitemap.db-journal   📝  SQLite temp file
sitemap-api          ⚙️  Compiled binary (from: go build)
go.sum               🔒  Dependency checksums (from: go mod download)

FILE COUNTS

Go source files:     5 files
HTML files:          1 file
Documentation:       3 files
Config files:        6 files
─────────────────────────
Total:               15 files

LINES OF CODE

handlers/handler.go    ~600 lines    (HTTP handlers, SSE, metadata)
crawler/crawler.go     ~250 lines    (Concurrent crawler)
database/db.go         ~250 lines    (SQLite operations)
models/site.go         ~50  lines    (Data structures)
main.go                ~70  lines    (Server setup)
static/index.html      ~850 lines    (Full UI with CSS & JS)
─────────────────────────────────────
Total:                 ~2,070 lines

KEY DEPENDENCIES (go.mod)

github.com/go-chi/chi/v5       Router & middleware
github.com/go-chi/cors         CORS support
github.com/google/uuid         UUID generation
github.com/mattn/go-sqlite3    SQLite driver
golang.org/x/net               HTML parsing

VISUAL TREE

sitemap-api/
│
├── 📄 main.go                    # Entry point & server
├── 📦 go.mod                     # Dependencies
├── 🚀 run.sh                     # Quick start
├── 🔧 Makefile                   # Build commands
├── 🐳 Dockerfile                 # Containerization
├── ⚙️  .env.example               # Config template
├── 🚫 .gitignore                 # Git exclusions
│
├── 📚 Documentation/
│   ├── README.md                 # Full docs
│   ├── QUICKSTART.md             # Quick start
│   └── PROJECT_OVERVIEW.md       # Implementation details
│
├── 🎯 handlers/
│   └── handler.go                # All HTTP endpoints + SSE
│
├── 🕷️  crawler/
│   └── crawler.go                # Concurrent web crawler
│
├── 💾 database/
│   └── db.go                     # SQLite operations
│
├── 📋 models/
│   └── site.go                   # Data structures
│
└── 🎨 static/
    └── index.html                # Frontend UI

DATA FLOW

User Browser
    │
    ├─► POST /generate-sitemap-xml
    │       └─► handlers.GenerateSitemapXML()
    │               ├─► Generate UUID
    │               ├─► Extract metadata (IP, browser, etc)
    │               ├─► database.CreateSite()
    │               ├─► streamManager.CreateStream(uuid)
    │               ├─► go crawler.Crawl() [goroutine]
    │               └─► Return {uuid, site_id, stream_url}
    │
    ├─► GET /stream/{uuid}
    │       └─► handlers.StreamSSE()
    │               └─► streamManager.GetStream(uuid)
    │                       └─► Forward events to browser
    │
    └─► GET /download/{uuid}
            └─► handlers.DownloadSitemap()
                    ├─► database.GetSiteByUUID()
                    ├─► database.GetPagesBySiteID()
                    └─► Generate XML sitemap

Crawler (goroutine)
    │
    ├─► Fetch URL
    ├─► Parse HTML links
    ├─► database.AddPage()
    ├─► Send SSE progress event
    └─► Recursively crawl children (with goroutines)

DATABASE SCHEMA

┌─────────────────┐
│     sites       │
├─────────────────┤
│ id              │ PK
│ uuid            │ UNIQUE (server-generated)
│ domain          │
│ url             │
│ max_depth       │
│ page_count      │
│ status          │ (processing/completed/failed)
│ ip_address      │ (client metadata)
│ user_agent      │
│ browser         │
│ browser_version │
│ os              │
│ device_type     │
│ session_id      │
│ cookies         │ (JSON)
│ referrer        │
│ created_at      │
│ completed_at    │
│ last_crawled    │
└─────────────────┘
        │
        │ 1:N
        ↓
┌─────────────────┐
│     pages       │
├─────────────────┤
│ id              │ PK
│ site_id         │ FK → sites.id
│ url             │ UNIQUE
│ depth           │
│ last_modified   │
│ priority        │ (0.0 - 1.0)
│ change_freq     │ (monthly/weekly/etc)
└─────────────────┘

┌─────────────────┐
│    sessions     │
├─────────────────┤
│ id              │ PK
│ session_id      │ UNIQUE
│ uuid            │ FK → sites.uuid
│ ip_address      │
│ created_at      │
│ last_activity   │
└─────────────────┘

CONCURRENCY MODEL

StreamManager
    ├─► map[uuid]chan Event      (thread-safe with mutex)
    │
    └─► Per-UUID Channel
            └─► Event stream to browser

Crawler
    ├─► Main goroutine (Crawl)
    │       └─► Spawns goroutines for each URL
    │
    └─► Semaphore (5 concurrent max)
            └─► Controls parallel requests