/ _ \ 
               \_\(_)/_/  NetCrawl
                   _//"\\_     TnYtCoder
     /   \
            

Website Crawler & Directory Discovery Tool

Fast β€’ Stealthy β€’ Professional

100+ Common Paths
30x Faster
2 Output Formats

Why NetCrawl?

⚑

Lightning Fast

Multi-threaded architecture crawls hundreds of URLs per minute. Configurable thread count for optimal performance.

🎨

Beautiful Output

Color-coded logs with real-time progress bars. Easy to read, easy to debug.

πŸ›‘οΈ

Smart & Stealthy

Rate limiting, user-agent rotation, and robots.txt compliance. Won't get you blocked.

πŸ“

Directory Discovery

Finds hidden directories, admin panels, backup files, and API endpoints automatically.

πŸ“Š

Rich Reports

TXT format for humans, JSON format for automation. Complete statistics and categorization.

πŸ—ΊοΈ

Sitemap Support

Parses robots.txt and sitemap.xml to discover even more URLs.

See It In Action

NetCrawl banner and help menu
Clean interface with ASCII art
Crawling in progress
Real-time crawling with progress bar
Final report
Comprehensive summary report
JSON export
Machine-readable JSON output

Installation

Quick Install
# Clone the repository
git clone https://github.com/TnYtCoder/NetCrawl.git
cd NetCrawl

# Install dependencies
pip install requests beautifulsoup4 colorama

# Verify installation
python netcrawl.py --help

Requirements

  • Python 3.7 or higher
  • pip package manager
  • Internet connection

Usage Guide

Basic Syntax

python netcrawl.py <target_url> [options]

Quick Examples

# Basic crawl
python netcrawl.py https://example.com

# Deep crawl
python netcrawl.py https://example.com --depth 5 --threads 20

# Fast scan
python netcrawl.py https://example.com --threads 30 --delay 0.1

# Limited crawl
python netcrawl.py https://example.com --max-urls 500

Command Reference

Option Description Default Example
--depth N Maximum crawl depth 3 --depth 5
--threads N Concurrent threads 10 --threads 20
--max-urls N URL limit 10000 --max-urls 5000
--timeout N Request timeout (seconds) 15 --timeout 10
--delay N Delay between requests 0.5 --delay 0.1
--no-color Disable colored output False --no-color

Real-World Examples

WordPress Site Discovery

$ python netcrawl.py https://wordpress-site.com

[β†’] 10:30:15 - Crawling: https://wordpress-site.com (Depth: 0)
[+] 10:30:16 - Found: https://wordpress-site.com/wp-admin (Status: 302)
[+] 10:30:16 - Found: https://wordpress-site.com/wp-content/ (Status: 403)
[+] 10:30:17 - Found: https://wordpress-site.com/wp-json/ (Status: 200)
[+] 10:30:18 - Found: https://wordpress-site.com/xmlrpc.php (Status: 405)
[+] 10:30:19 - Found in robots.txt: /wp-admin/
[+] 10:30:20 - Found in sitemap: https://wordpress-site.com/about

Progress: 47/234 URLs β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘

API Endpoint Discovery

$ python netcrawl.py https://api.example.com --depth 2 --delay 0.3

[β†’] 10:31:20 - Crawling: https://api.example.com (Depth: 0)
[+] 10:31:21 - Found: https://api.example.com/v1/users (Status: 401)
[+] 10:31:21 - Found: https://api.example.com/v1/products (Status: 200)
[+] 10:31:22 - Found: https://api.example.com/v2/auth (Status: 404)
[+] 10:31:22 - Found: https://api.example.com/docs (Status: 200)
[+] 10:31:23 - Found: https://api.example.com/graphql (Status: 400)
[+] 10:31:24 - Found: https://api.example.com/swagger (Status: 200)

Progress: 23/89 URLs β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘

Security Audit Findings

$ python netcrawl.py https://testsite.com --depth 4 --threads 15

[β†’] 10:32:10 - Crawling: https://testsite.com (Depth: 0)
[βœ“] 10:32:11 - Found: https://testsite.com/.env (Status: 403)
[βœ“] 10:32:11 - Found: https://testsite.com/backup.zip (Status: 200)
[βœ“] 10:32:12 - Found: https://testsite.com/admin (Status: 200)
[βœ“] 10:32:12 - Found: https://testsite.com/phpinfo.php (Status: 200)
[βœ“] 10:32:13 - Found: https://testsite.com/.git/config (Status: 404)

⚠️  Sensitive files detected! Review with caution.

Sample Report

======================================================================
πŸ“Š CRAWL REPORT
======================================================================

🎯 Target: https://example.com
πŸ• Duration: 45.23 seconds
πŸ“Š Requests: 1,245
πŸ“¦ Data: 12.4 MB
❌ Errors: 3

πŸ“ˆ SUMMARY
----------------------------------------
Total URLs: 1,042
Total Directories: 156
Total Files: 886

πŸ“ FILES BY TYPE
----------------------------------------
β–Έ HTML: 342 files
β–Έ JavaScript: 89 files
β–Έ CSS: 45 files
β–Έ Images: 234 files
β–Έ Documents: 12 files
β–Έ API Endpoints: 78 files
β–Έ Other: 86 files

πŸ“‚ TOP DIRECTORIES
----------------------------------------
  /                         β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ 342
  /assets/                  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 156
  /api/                     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 134
  /images/                  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 98
  /admin/                   β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 67
  /backup/                  β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 23

API Reference

NetCrawl Class

class NetCrawl:
    """
    Main crawler class for website discovery.
    
    Args:
        target_url (str): Website to crawl
        max_depth (int): Maximum recursion depth (default: 3)
        max_threads (int): Concurrent threads (default: 10)
        max_urls (int): URL limit (default: 10000)
        timeout (int): Request timeout (default: 15)
        delay (float): Delay between requests (default: 0.5)
    """

start_crawl()

Begins the crawling process. Handles robots.txt, sitemaps, and recursive crawling.

generate_report()

Displays final statistics and discovered resources in the console.

save_results()

Prompts user for format (TXT/JSON) and exports results to file.

_process_url(url, depth)

Internal method that processes individual URLs and extracts links.

Output Formats

TXT Format

Human-readable format with full URL listing and categorization.

netcrawl_example.com_20240317_143022.txt

JSON Format

Machine-parsable format for integration with other tools.

{
  "tool": "NetCrawl",
  "author": "TnYtCoder",
  "target": "https://example.com",
  "urls": [...]
}

Performance Guide

Thread Recommendations

Threads Speed Use Case
5-10 Conservative Production sites, rate-limited
15-25 Aggressive Testing, development
30+ Extreme Local/authorized only

Delay Recommendations

Delay Risk Level Use Case
0.1s High Local testing
0.3-0.5s Medium General crawling
1.0s+ Low Stealth mode

Troubleshooting

Common Issues

Module not found
pip install -r requirements.txt
Too slow
Increase threads: --threads 20 --delay 0.2
Timeouts
Increase timeout: --timeout 30
Memory high
Reduce URL limit: --max-urls 2000
No colors
Terminal doesn't support ANSI: --no-color

Error Messages

ConnectionError
Network issue - check internet, increase timeout
SSLError
Certificate issue - update certifi or ignore (not recommended)
TooManyRedirects
Redirect loop - reduce depth, check target
MemoryError
Too many URLs - use --max-urls

Legal Disclaimer

⚠️

IMPORTANT: Read Before Using

This tool performs ACTIVE crawling on target websites. Usage is permitted ONLY on:

  • Websites you own
  • Websites with explicit written permission
  • Local development environments

Unauthorized crawling may violate:

  • Computer Fraud and Abuse Act (CFAA) - United States
  • Computer Misuse Act 1990 - United Kingdom
  • Similar cybercrime laws worldwide

By using this software, you assume all liability for your actions. The author accepts no responsibility for misuse.