This commit is contained in:
Kar
2026-02-20 20:19:16 +05:30
commit 995c49518f
6 changed files with 861 additions and 0 deletions

118
README.md Normal file
View File

@@ -0,0 +1,118 @@
# siliconpin_spider
A Go-based web crawler with per-domain SQLite storage, robots.txt compliance,
randomised polite delays, and Server-Sent Events (SSE) for real-time progress.
---
## Requirements
| Tool | Notes |
|------|-------|
| Go 1.21+ | |
| GCC | Required by `go-sqlite3` (CGO) |
```bash
# Ubuntu / Debian
apt install gcc
# macOS (Xcode CLI tools)
xcode-select --install
```
---
## Run
```bash
go mod tidy
go run main.go
# Server → http://localhost:8080
```
---
## On startup
- Creates **`siliconpin_spider.sqlite`** (domains registry)
- Serves `./static/` at `/`
- **Resumes** any crawls that were previously registered
---
## API
### `POST /api/add_domain`
Register a domain and immediately start crawling it.
```bash
curl -X POST http://localhost:8080/api/add_domain \
-H "Content-Type: application/json" \
-d '{"domain":"siliconpin.com","Crawl-delay":"20"}'
```
**Body fields**
| Field | Required | Default | Notes |
|-------|----------|---------|-------|
| `domain` | ✅ | — | bare domain, scheme/www stripped automatically |
| `Crawl-delay` | ❌ | `60` | seconds; actual delay is random in `[N, N*2]` |
**Response `201`**
```json
{
"message": "domain added, crawler started",
"domain": "siliconpin.com",
"interval": 20,
"db_file": "siliconpin.com.sqlite",
"sse": "/api/sse/siliconpin.com"
}
```
Creates **`siliconpin.com.sqlite`** with table:
```
urls(id, url UNIQUE, created_at, updated_at)
```
---
### `GET /api/sse/{domain}`
Stream crawl events for any registered domain as **Server-Sent Events**.
```bash
curl -N http://localhost:8080/api/sse/siliconpin.com
curl -N http://localhost:8080/api/sse/cicdhosting.com
```
Each `data:` line is a JSON object:
```
data: {"event":"connected", "data":{"domain":"siliconpin.com"}}
data: {"event":"status", "data":{"msg":"fetching robots.txt"}}
data: {"event":"robots", "data":{"disallowed":["/admin/"],"robots_delay":10,"effective_delay":20}}
data: {"event":"waiting", "data":{"url":"https://siliconpin.com/about","delay_s":27,"queue":4}}
data: {"event":"fetching", "data":{"url":"https://siliconpin.com/about"}}
data: {"event":"saved", "data":{"url":"…","status":200,"content_type":"text/html"}}
data: {"event":"links_found","data":{"url":"…","found":12,"new":8,"queue_len":12}}
data: {"event":"skipped", "data":{"url":"…","reason":"robots.txt"}}
data: {"event":"error", "data":{"url":"…","msg":"…"}}
data: {"event":"done", "data":{"domain":"siliconpin.com","msg":"crawl complete"}}
: keepalive
```
Multiple browser tabs / curl processes can listen to the **same** domain stream simultaneously.
---
## Crawl behaviour
1. Fetches `robots.txt`; respects `Disallow` paths and `Crawl-delay`
- If `robots.txt` specifies a higher delay than you set, the higher value wins
2. BFS queue same-host HTML links only
3. Random delay between requests: **`interval``interval × 2`** seconds
4. Skips already-visited URLs (checked against the domain's SQLite)
5. On restart, existing domains resume from where they left off (unvisited URLs are re-queued from the start URL; already saved URLs are skipped)