🚀 Universal Business Directory Scraper Framework

A scalable, modular Python framework for scraping business directories worldwide. Built with Playwright for robust automation and designed for easy expansion to new sites.

📁 Project Structure

📦 Business Directory Scraper
├── 🔧 Core Framework Files
│   ├── scraper_framework.py           # Main framework with BaseScraper & configs
│   ├── yellowpages_family_scrapers.py # Template for YellowPages-style sites
│   ├── google_maps_scraper.py         # Google Maps specific scraper
│   └── scrape_final.py               # Original JustDial scraper (standalone)
│
├── 🧪 Demo & Testing Files
│   ├── final_updated_demo.py         # Complete framework demonstration
│   ├── test_updated_selectors.py     # Selector validation & testing
│   └── template_demo.py              # Template pattern demonstration
│
├── 📊 Data & Utilities
│   ├── score_leads.py                # Lead scoring system
│   ├── README.md                     # This documentation
│   └── *.csv                         # Output data files
│
└── 🖼️ Debug Files
    └── *.png                         # Debug screenshots

🎯 Quick Start

1. Install Dependencies

pip install playwright pandas
playwright install chromium

2. Run the Framework Demo

python final_updated_demo.py

3. Test Individual Sites

# Test YellowPages only
python test_updated_selectors.py

# Test original JustDial scraper
python scrape_final.py

🏗️ Framework Architecture

Core Components

1. BaseScraper (Abstract Base Class)

class BaseScraper(ABC):
    def build_search_url(self, query: str, location: str) -> str
    def extract_business_cards(self, page) -> List
    def extract_business_data(self, card, detail_page=None) -> Dict
    def scrape(self, query: str, location: str) -> List[Dict]

2. SiteConfig (Configuration Dataclass)

@dataclass
class SiteConfig:
    name: str
    base_url: str
    search_url_template: str
    selectors: Dict[str, str]
    pagination_type: str
    requires_detail_page: bool = True
    rate_limit: float = 0.5

3. ScraperFactory (Site Creation)

scraper = ScraperFactory.create_scraper('yellowpages_us')
results = scraper.scrape("restaurants", "new york")

🌍 Currently Supported Sites

✅ Fully Working

Site	Country	Status	Results
YellowPages.com	🇺🇸 USA	✅ Active	30+ businesses
Google Maps	🌍 Global	✅ Active	6+ businesses
JustDial.com	🇮🇳 India	✅ Active	50+ businesses

🚧 Template Ready

Site	Country	Status	Notes
YellowPages.ca	🇨🇦 Canada	🔧 Configured	Needs testing
YellowPages.com.au	🇦🇺 Australia	🔧 Configured	Needs testing
Yell.com	🇬🇧 UK	🔧 Configured	Needs testing
GelbeSeiten.de	🇩🇪 Germany	🔧 Configured	Needs testing
PagesJaunes.fr	🇫🇷 France	🔧 Configured	Needs testing

📝 How to Add New Sites

Method 1: Using YellowPages Template (Recommended)

If the new site has a similar structure to YellowPages:

Step 1: Add Site Configuration

# In scraper_framework.py, add to SITE_CONFIGS:
'new_site_name': SiteConfig(
    name="NewSite.com (Country)",
    base_url="https://www.newsite.com",
    search_url_template="https://www.newsite.com/search?q={query}&loc={location}",
    selectors={
        **YELLOWPAGES_TEMPLATE['selectors'],  # Inherit from template
        'business_cards': '.listing-item',     # Override if different
        'name': '.business-title a',
        'phone': '.contact-phone'
    },
    pagination_type='pagination',
    requires_detail_page=False,
    rate_limit=0.4
)

Step 2: Create Scraper Class (Optional)

# In yellowpages_family_scrapers.py:
class NewSiteScraper(BaseYellowPagesScraper):
    """Scraper for NewSite.com"""

    def build_search_url(self, query: str, location: str) -> str:
        """Custom URL building if needed"""
        return self.config.search_url_template.format(
            query=query.replace(' ', '%20'),
            location=location.replace(' ', '%20')
        )

Step 3: Test Your New Site

# Test script
from scraper_framework import SITE_CONFIGS
from yellowpages_family_scrapers import NewSiteScraper

config = SITE_CONFIGS['new_site_name']
scraper = NewSiteScraper(config)
results = scraper.scrape("restaurants", "london")
print(f"Found {len(results)} businesses")

Method 2: Custom Site Scraper

For unique sites that don't follow the YellowPages pattern:

Step 1: Create New Scraper File

# new_site_scraper.py
from scraper_framework import BaseScraper

class NewSiteScraper(BaseScraper):
    def build_search_url(self, query: str, location: str) -> str:
        # Custom URL building logic
        return f"https://newsite.com/search/{query}/{location}"

    def extract_business_cards(self, page) -> list:
        # Custom card extraction
        page.wait_for_selector('.business-item')
        return page.query_selector_all('.business-item')

    def extract_business_data(self, card, detail_page=None) -> dict:
        # Custom data extraction
        name = self.safe_extract(card.query_selector('.title'))
        phone = self.safe_extract(card.query_selector('.phone'))

        return {
            'name': name,
            'phone': phone,
            'address': '',
            'website': '',
            'link': ''
        }

Step 2: Add to Factory

# In scraper_framework.py, add to ScraperFactory:
elif site_name == 'new_site':
    from new_site_scraper import NewSiteScraper
    return NewSiteScraper(SITE_CONFIGS[site_name])

🔧 Selector Discovery & Debugging

Finding the Right Selectors

Step 1: Inspect the Target Site

Open the site in Chrome/Edge
Search for a business type
Right-click on business cards → "Inspect Element"
Find CSS selectors for:
- Business card container
- Business name
- Phone number
- Address
- Website link

Step 2: Test Selectors

# Use the debug mode in test_updated_selectors.py
python test_updated_selectors.py
# Choose option 4 for "Debug selectors"

Step 3: Update Selectors

# In scraper_framework.py, update the selectors:
'selectors': {
    'business_cards': '.search-result',     # Main container
    'name': '.business-name a',             # Business name link
    'phone': '.phone-number',               # Phone number
    'address': '.business-address',         # Full address
    'website': '.website-link',             # Website URL
    'link': '.business-name a'              # Detail page link
}

Common Selector Patterns

Element	Common Selectors
Business Cards	`.listing`, `.search-result`, `.business-item`, `.company-item`
Business Name	`.business-name`, `.company-name`, `.listing-title`, `h2 a`, `h3 a`
Phone	`.phone`, `.telephone`, `.contact-phone`, `[href^="tel:"]`
Address	`.address`, `.location`, `.business-address`, `.adr`
Website	`.website`, `.url`, `[href^="http"]:not([href*="yellowpages"])`

🚀 Running Multi-Site Scraping

Basic Multi-Site Usage

from scraper_framework import multi_site_scrape

# Scrape multiple sites at once
results = multi_site_scrape(
    query="italian restaurants",
    location="chicago",
    sites=['yellowpages_us', 'google_maps']
)

print(f"Total results: {len(results)}")

Advanced Multi-Site with Custom Sites

# Add your new site to the available sites
results = multi_site_scrape(
    query="coffee shops",
    location="seattle",
    sites=['yellowpages_us', 'google_maps', 'new_site_name']
)

📊 Data Output Format

All scrapers return a standardized dictionary format:

{
    'name': 'Business Name',
    'phone': '(555) 123-4567',
    'address': '123 Main St, City, State 12345',
    'full_address': '123 Main St, City, State 12345',
    'website': 'https://business-website.com',
    'link': 'https://directory-site.com/business-page',
    'rating': '4.5',
    'rating_count': '(123)',
    'query': 'restaurants',
    'location': 'new york',
    'source': 'YellowPages.com (USA)'
}

🧪 Testing New Sites

1. Selector Validation

# Test if selectors find elements
python test_updated_selectors.py

2. Data Quality Check

# Check data completeness
results = scraper.scrape("test query", "test location")
phone_coverage = sum(1 for r in results if r.get('phone')) / len(results)
print(f"Phone coverage: {phone_coverage:.1%}")

3. Debug Mode

# Enable debug mode for screenshots and detailed logging
scraper = NewSiteScraper(config, debug=True)

🛠️ Troubleshooting

Common Issues & Solutions

❌ "No business cards found"

# Solution: Test different selectors
selectors_to_try = ['.listing', '.result', '.business', '.company']
for selector in selectors_to_try:
    elements = page.query_selector_all(selector)
    print(f"{selector}: {len(elements)} found")

❌ "Found cards but no data extracted"

# Solution: Check if name selector is working
name_elem = card.query_selector('.business-name')
if not name_elem:
    # Try different name selectors
    name_elem = card.query_selector('h2 a') or card.query_selector('h3 a')

❌ "Phone/Website not found"

# Solution: Some sites require clicking to reveal contact info
# Set requires_detail_page=True in SiteConfig

Debug Tips

Use browser developer tools to find correct selectors
Enable debug mode for screenshots and logging
Test with simple queries first (e.g., "restaurants")
Check rate limiting if getting blocked
Verify user agent is set properly

📈 Performance Optimization

Rate Limiting

# Adjust rate limiting per site
SiteConfig(
    rate_limit=0.5,  # 500ms between requests
    # More aggressive sites may need 1.0-2.0 seconds
)

Pagination Handling

# Set pagination type based on site behavior
pagination_type='infinite_scroll'  # For sites that load on scroll
pagination_type='pagination'       # For sites with page numbers
pagination_type='load_more'        # For sites with "Load More" buttons

🎯 Lead Scoring Integration

The framework includes a lead scoring system:

# Score leads after scraping
from score_leads import score_business_leads

scored_results = score_business_leads(results)

📞 Support & Contributing

Adding New Sites

Follow the template pattern above
Test thoroughly with different queries
Update this README with new site info
Submit with sample results

Reporting Issues

When reporting issues, include:

Site name and URL
Query and location used
Error messages
Debug screenshots (if available)

🏆 Success Metrics

Current Framework Performance:

✅ YellowPages: 30+ businesses, 83% phone coverage
✅ Google Maps: 6+ businesses, 100% rating coverage
✅ JustDial: 50+ businesses, dynamic selector system
✅ Template System: 5+ international sites ready
✅ Data Quality: 83% phone, 64% website coverage

Ready for production use! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
README.md		README.md
example_add_new_site.py		example_add_new_site.py
final_updated_demo.py		final_updated_demo.py
google_maps_scraper.py		google_maps_scraper.py
multi_site_results_interior designers_new york.csv		multi_site_results_interior designers_new york.csv
score_leads.py		score_leads.py
scrape_final.py		scrape_final.py
scraper_framework.py		scraper_framework.py
template_demo.py		template_demo.py
test_updated_selectors.py		test_updated_selectors.py
usage_guide.py		usage_guide.py
yellowpages_family_scrapers.py		yellowpages_family_scrapers.py

himanchal08/Lead

Folders and files

Latest commit

History

Repository files navigation