This commit is contained in:
Kar
2026-03-11 21:18:24 +05:30
commit 26e70981ee
7 changed files with 695 additions and 0 deletions

147
README.md Normal file
View File

@@ -0,0 +1,147 @@
# Internet Domain Database
This project maintains two separate tables for internet domain data:
- `domain_root`: Top-Level Domains (TLDs) from IANA
- `domain_suffix`: Public Suffix List from Mozilla
## Database Connection
- **Host**: l2
- **Port**: 3306
- **User**: root
- **Database**: sp_spider
## Table Structure
### domain_root
Contains IANA TLD data with unique root domains and soft delete functionality.
```sql
CREATE TABLE domain_root (
id INT AUTO_INCREMENT PRIMARY KEY,
root VARCHAR(63) NOT NULL UNIQUE,
removed BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_root (root),
INDEX idx_removed (removed)
);
```
### domain_suffix
Contains Public Suffix List data with unique suffixes and soft delete functionality.
```sql
CREATE TABLE domain_suffix (
id INT AUTO_INCREMENT PRIMARY KEY,
suffix VARCHAR(255) NOT NULL UNIQUE,
removed BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_suffix (suffix),
INDEX idx_removed (removed)
);
```
## Setup
1. Install dependencies:
```bash
pip install -r requirements.txt
```
2. Create the tables:
```bash
mariadb -h l2 -u root -p0000 --ssl=FALSE sp_spider < create_separate_tables.sql
```
3. Add the removed column (for existing installations):
```bash
mariadb -h l2 -u root -p0000 --ssl=FALSE sp_spider < add_removed_column.sql
```
4. Populate the tables:
```bash
python populate_separate_tables.py 0000
```
5. Update tables with soft delete functionality:
```bash
python update.py 0000
```
## Data Sources
- **TLD Data**: https://data.iana.org/TLD/tlds-alpha-by-domain.txt
- Contains official Top-Level Domains
- Currently: 1,436 TLDs
- **Public Suffix List**: https://publicsuffix.org/list/public_suffix_list.dat
- Contains public domain suffixes including TLDs, ccTLDs, and private domain suffixes
- Currently: 10,067 suffixes
## Usage
Query active TLDs:
```sql
SELECT * FROM domain_root WHERE removed = FALSE AND root = 'com';
```
Query removed TLDs:
```sql
SELECT * FROM domain_root WHERE removed = TRUE;
```
Query active suffixes:
```sql
SELECT * FROM domain_suffix WHERE removed = FALSE AND suffix LIKE '%.com';
```
Query removed suffixes:
```sql
SELECT * FROM domain_suffix WHERE removed = TRUE;
```
Get statistics:
```sql
SELECT
'domain_root' as table_name,
COUNT(*) as total,
SUM(CASE WHEN removed = FALSE THEN 1 ELSE 0 END) as active,
SUM(CASE WHEN removed = TRUE THEN 1 ELSE 0 END) as removed
FROM domain_root
UNION ALL
SELECT
'domain_suffix' as table_name,
COUNT(*) as total,
SUM(CASE WHEN removed = FALSE THEN 1 ELSE 0 END) as active,
SUM(CASE WHEN removed = TRUE THEN 1 ELSE 0 END) as removed
FROM domain_suffix;
```
## Project Files
- `create_separate_tables.sql` - Table creation script
- `add_removed_column.sql` - Script to add removed column for soft delete functionality
- `populate_separate_tables.py` - Initial data population script
- `update.py` - Update script with soft delete functionality
- `requirements.txt` - Python dependencies
- `README.md` - This documentation
## Soft Delete Functionality
The `update.py` script provides soft delete functionality that:
1. **Marks entries as removed**: Sets `removed = TRUE` for entries no longer found in source data
2. **Adds new entries**: Inserts new entries from source with `removed = FALSE`
3. **Restores entries**: Sets `removed = FALSE` for previously removed entries that reappear in source
4. **Provides statistics**: Shows counts of active, removed, new, and restored entries
### Update Process
The update script:
- Fetches latest data from IANA TLD and Public Suffix List
- Compares with current database entries
- Performs batch updates for efficiency
- Handles duplicate entries gracefully with `INSERT IGNORE`
- Updates `updated_at` timestamp for all changes
Run the update script periodically to keep the database synchronized with source data.