148 lines
4.0 KiB
Markdown
148 lines
4.0 KiB
Markdown
# Internet Domain Database
|
|
|
|
This project maintains two separate tables for internet domain data:
|
|
- `domain_root`: Top-Level Domains (TLDs) from IANA
|
|
- `domain_suffix`: Public Suffix List from Mozilla
|
|
|
|
## Database Connection
|
|
|
|
- **Host**: l2
|
|
- **Port**: 3306
|
|
- **User**: root
|
|
- **Database**: sp_spider
|
|
|
|
## Table Structure
|
|
|
|
### domain_root
|
|
Contains IANA TLD data with unique root domains and soft delete functionality.
|
|
```sql
|
|
CREATE TABLE domain_root (
|
|
id INT AUTO_INCREMENT PRIMARY KEY,
|
|
root VARCHAR(63) NOT NULL UNIQUE,
|
|
removed BOOLEAN DEFAULT FALSE,
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
|
|
INDEX idx_root (root),
|
|
INDEX idx_removed (removed)
|
|
);
|
|
```
|
|
|
|
### domain_suffix
|
|
Contains Public Suffix List data with unique suffixes and soft delete functionality.
|
|
```sql
|
|
CREATE TABLE domain_suffix (
|
|
id INT AUTO_INCREMENT PRIMARY KEY,
|
|
suffix VARCHAR(255) NOT NULL UNIQUE,
|
|
removed BOOLEAN DEFAULT FALSE,
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
|
|
INDEX idx_suffix (suffix),
|
|
INDEX idx_removed (removed)
|
|
);
|
|
```
|
|
|
|
## Setup
|
|
|
|
1. Install dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
2. Create the tables:
|
|
```bash
|
|
mariadb -h l2 -u root -p0000 --ssl=FALSE sp_spider < create_separate_tables.sql
|
|
```
|
|
|
|
3. Add the removed column (for existing installations):
|
|
```bash
|
|
mariadb -h l2 -u root -p0000 --ssl=FALSE sp_spider < add_removed_column.sql
|
|
```
|
|
|
|
4. Populate the tables:
|
|
```bash
|
|
python populate_separate_tables.py 0000
|
|
```
|
|
|
|
5. Update tables with soft delete functionality:
|
|
```bash
|
|
python update.py 0000
|
|
```
|
|
|
|
## Data Sources
|
|
|
|
- **TLD Data**: https://data.iana.org/TLD/tlds-alpha-by-domain.txt
|
|
- Contains official Top-Level Domains
|
|
- Currently: 1,436 TLDs
|
|
|
|
- **Public Suffix List**: https://publicsuffix.org/list/public_suffix_list.dat
|
|
- Contains public domain suffixes including TLDs, ccTLDs, and private domain suffixes
|
|
- Currently: 10,067 suffixes
|
|
|
|
## Usage
|
|
|
|
Query active TLDs:
|
|
```sql
|
|
SELECT * FROM domain_root WHERE removed = FALSE AND root = 'com';
|
|
```
|
|
|
|
Query removed TLDs:
|
|
```sql
|
|
SELECT * FROM domain_root WHERE removed = TRUE;
|
|
```
|
|
|
|
Query active suffixes:
|
|
```sql
|
|
SELECT * FROM domain_suffix WHERE removed = FALSE AND suffix LIKE '%.com';
|
|
```
|
|
|
|
Query removed suffixes:
|
|
```sql
|
|
SELECT * FROM domain_suffix WHERE removed = TRUE;
|
|
```
|
|
|
|
Get statistics:
|
|
```sql
|
|
SELECT
|
|
'domain_root' as table_name,
|
|
COUNT(*) as total,
|
|
SUM(CASE WHEN removed = FALSE THEN 1 ELSE 0 END) as active,
|
|
SUM(CASE WHEN removed = TRUE THEN 1 ELSE 0 END) as removed
|
|
FROM domain_root
|
|
UNION ALL
|
|
SELECT
|
|
'domain_suffix' as table_name,
|
|
COUNT(*) as total,
|
|
SUM(CASE WHEN removed = FALSE THEN 1 ELSE 0 END) as active,
|
|
SUM(CASE WHEN removed = TRUE THEN 1 ELSE 0 END) as removed
|
|
FROM domain_suffix;
|
|
```
|
|
|
|
## Project Files
|
|
|
|
- `create_separate_tables.sql` - Table creation script
|
|
- `add_removed_column.sql` - Script to add removed column for soft delete functionality
|
|
- `populate_separate_tables.py` - Initial data population script
|
|
- `update.py` - Update script with soft delete functionality
|
|
- `requirements.txt` - Python dependencies
|
|
- `README.md` - This documentation
|
|
|
|
## Soft Delete Functionality
|
|
|
|
The `update.py` script provides soft delete functionality that:
|
|
|
|
1. **Marks entries as removed**: Sets `removed = TRUE` for entries no longer found in source data
|
|
2. **Adds new entries**: Inserts new entries from source with `removed = FALSE`
|
|
3. **Restores entries**: Sets `removed = FALSE` for previously removed entries that reappear in source
|
|
4. **Provides statistics**: Shows counts of active, removed, new, and restored entries
|
|
|
|
### Update Process
|
|
|
|
The update script:
|
|
- Fetches latest data from IANA TLD and Public Suffix List
|
|
- Compares with current database entries
|
|
- Performs batch updates for efficiency
|
|
- Handles duplicate entries gracefully with `INSERT IGNORE`
|
|
- Updates `updated_at` timestamp for all changes
|
|
|
|
Run the update script periodically to keep the database synchronized with source data.
|