Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Browser listening on: ws://127.0.0.1:59948 #35

Open
CoderStylus opened this issue Jan 20, 2025 · 7 comments
Open

Browser listening on: ws://127.0.0.1:59948 #35

CoderStylus opened this issue Jan 20, 2025 · 7 comments

Comments

@CoderStylus
Copy link
Contributor

I've had this issue a few times in the development of an app utilising PyPartPicker, but it's usally resolved itself as I've fixed other problems. But now I'm in a bit of a stalemate.

I've just implemented proxy rotation and asyncio functions as per the documentation suggests, and it seems to be working fine. I moved the proxy debugging to the main function to test the proxies but had to move it back to response retreiver since it just pooped itself with the debugging in the main function for some reason.

Anyway, it goes through about 10 parts and then just hangs on "2025-01-20 19:15:56,761 - INFO - Browser listening on: ws://127.0.0.1:59948/devtools/browser/f358eb04-9c23-4952-b9c6-a92e91b1fe9b". I know it's a localhost loopback ip, but I just can't see why it hangs. It should retry if the connection is unsuccessful anyway!

I've scoured pretty much all of StackOverflow and the RequestsHTML documentation but can't see to find anything on it at all. At first I thought it had something to do with rate limiting but now I'm having second thoughts. I left it sit for about 45 miniutes and eventually it closed the connection.

Then I though that perhaps it was something to do with the the chromium processes not terminating after completing the scrape, so I added some logic that closes it manually. I don't think it the custom response retriever either.

Anyway, I don't know if this is a library wide thing or just a issue with my program, but since nothing else seems to have any answers I thought i'd give it a shot here. Let me know if anyone has any ideas :)

Program below:

import pypartpicker
import time
import random
from supabase import create_client, Client
import logging
from requests.exceptions import HTTPError
import requests_html
import asyncio
from itertools import cycle
from contextlib import closing

proxy_list = [
    "socks5://**CENSORED**:**CENSORED**@198.23.239.134:6540",
    "socks5://**CENSORED**:**CENSORED**@207.244.217.165:6712",
    "socks5://**CENSORED**:**CENSORED**@107.172.163.27:6543",
    "socks5://**CENSORED**:**CENSORED**@64.137.42.112:5157",
    "socks5://**CENSORED**:**CENSORED**@173.211.0.148:6641",
    "socks5://**CENSORED**:**CENSORED**@161.123.152.115:6360",
    "socks5://**CENSORED**:**CENSORED**[email protected]:6754",
    "socks5://**CENSORED**:**CENSORED**@154.36.110.199:6853",
    "socks5://**CENSORED**:**CENSORED**@173.0.9.70:5653",
    "socks5://**CENSORED**:**CENSORED**@173.0.9.209:5792",
]
proxy_cycle = cycle(proxy_list)
    
session = requests_html.HTMLSession()
# session.browser.args = ["--no-sandbox", "--disable-setuid-sandbox", "--disable-dev-shm-usage"]
session.browser.timeout = 5


# Setup logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

# Initialize Supabase client
url = "https://*CENSORED*.supabase.co" 
key = "*CENSORED*"  
try:
    supabase: Client = create_client(url, key)
    logging.info("Connected to Supabase successfully.")
except Exception as e:
    logging.error(f"Failed to connect to Supabase: {e}")
    exit(1)

# Supported Product Types
CPU_PRODUCTS = ["Intel Core i3", "Intel Core i5", "Intel Core i7", "Intel Core i9", "Intel Xeon", "AMD Ryzen 3", "AMD Ryzen 5", "AMD Ryzen 7", "AMD Ryzen 9", "AMD Threadripper", "AMD Athlon 3000G", "AMD Athlon 200GE", "Intel Pentium G6400", "Intel Pentium 5", "Intel Pentium 6", "Intel Pentium 7", "Intel Celeron G4", "Intel Celeron G5", "Intel Celeron G6"]    

def response_retriever(url):
    retries = 5
    backoff_time = 2  # Initial backoff in seconds
    proxy_api_url = "https://api.datascrape.tech/latest/ip"

    for _ in range(retries):
        proxy = next(proxy_cycle)  # Rotate proxy
        try:
            logging.info(f"Attempting to use proxy {proxy} for {url}")

            # Verify outgoing IP using the external service
            proxy_check_response = session.get(
                proxy_api_url, proxies={"http": proxy, "https": proxy}, timeout=10
            )
            if proxy_check_response.status_code != 200:
                logging.warning(f"Failed to verify proxy {proxy}: Status {proxy_check_response.status_code}")
                continue
            
            try:
                proxy_data = proxy_check_response.json()
                outgoing_ip = proxy_data.get("ip")
                if not outgoing_ip:
                    logging.warning(f"Proxy {proxy} did not return a valid IP.")
                    continue
            except Exception as e:
                logging.error(f"Error parsing proxy verification response: {e}")
                continue
            
            with closing(session.get(url, proxies={"http": proxy, "https": proxy})) as response:
                # Verify proxy works by making the actual request
                response = session.get(url, proxies={"http": proxy, "https": proxy})
            if response.status_code == 200:
                logging.info(f"Successfully retrieved response from {url} using proxy {outgoing_ip}")
                return response
            elif response.status_code == 429:
                logging.warning(f"Rate limited (429) for {url}. Retrying in {backoff_time}s...")
                time.sleep(backoff_time)
                backoff_time *= random.uniform(1.5, 2.0)  # Exponential backoff
            else:
                logging.warning(f"Unexpected status code {response.status_code} for {url}")
                return None
        except requests_html.RequestException as e:
            logging.error(f"Proxy {proxy} failed for {url}: {e}")
            time.sleep(backoff_time)
            backoff_time *= random.uniform(1.5, 2.0)  # Exponential backoff

    logging.error(f"Failed to retrieve a valid response after {retries} attempts for {url}")
    return None


# Initialize PyPartPicker Client with the custom response retriever
pcpp = pypartpicker.Client(response_retriever=response_retriever)

async def fetch_top_parts():
    async with pypartpicker.AsyncClient() as pcpp:
        # Counters for debugging
        error_count = 0
        warning_count = 0
        skipped_parts = []



        # Iterate through each product type and fetch all results
        for product in CPU_PRODUCTS:
            page = 1
            while True:
                try:
                    logging.info(f"Fetching {product} parts on page {page}...")
                    result = await pcpp.get_part_search(product, page=page, region="au")
                    if result and result.parts:
                        for part_summary in result.parts:
                            if part_summary and part_summary.url:
                                while True:
                                    try:
                                        proxy = next(proxy_cycle)
                                        logging.info(f"Using proxy {proxy} for {part_summary.url}")
                                        part = await pcpp.get_part(part_summary.url)
                                        if part:
                                            # Validate and prepare data for insertion
                                            in_stock_vendors = [
                                                vendor for vendor in part.vendors if vendor.in_stock
                                            ] if part.vendors else []
                                            in_stock_vendors.sort(key=lambda v: v.price.total if v.price else float('inf'))

                                            cheapest_vendor = in_stock_vendors[0] if in_stock_vendors else None
                                            data = {
                                                "part_type": "processor",
                                                "name": part.name if part.name else None,
                                                "total_price": cheapest_vendor.price.total if cheapest_vendor and cheapest_vendor.price else None,
                                                "base_price": cheapest_vendor.price.base if cheapest_vendor and cheapest_vendor.price else None,
                                                "discounts": cheapest_vendor.price.discounts if cheapest_vendor and cheapest_vendor.price else None,
                                                "shipping_price": cheapest_vendor.price.shipping if cheapest_vendor and cheapest_vendor.price else None,
                                                "tax_price": cheapest_vendor.price.tax if cheapest_vendor and cheapest_vendor.price else None,
                                                "vendor_store": getattr(cheapest_vendor, "name", "N/A") if cheapest_vendor else None,
                                                "store_product_url": getattr(cheapest_vendor, "buy_url", "N/A") if cheapest_vendor else None,
                                                "vendor_logo_url": getattr(cheapest_vendor, "logo_url", "N/A") if cheapest_vendor else None,
                                                "in_stock": bool(in_stock_vendors) and cheapest_vendor is not None,
                                                "product_url": getattr(part, "url", "N/A"),
                                                "image_urls": part.image_urls if part.image_urls else None,
                                                "manufacturer": part.specs.get("Manufacturer", None) if part.specs else None,
                                                "part_number": part.specs.get("Part #", None) if part.specs else None,
                                                "series": part.specs.get("Series", None) if part.specs else None,
                                                "microarchitecture": part.specs.get("Microarchitecture", None) if part.specs else None,
                                                "core_family": part.specs.get("Core Family", None) if part.specs else None,
                                                "socket": part.specs.get("Socket", None) if part.specs else None,
                                                "core_count": part.specs.get("Core Count", None) if part.specs else None,
                                                "thread_count": part.specs.get("Thread Count", None) if part.specs else None,
                                                "performance_core_clock": part.specs.get("Performance Core Clock", None) if part.specs else None,
                                                "performance_core_boost_clock": part.specs.get("Performance Core Boost Clock", None) if part.specs else None,
                                                "l2_cache": part.specs.get("L2 Cache", None) if part.specs else None,
                                                "l3_cache": part.specs.get("L3 Cache", None) if part.specs else None,
                                                "tdp": part.specs.get("TDP", None) if part.specs else None,
                                                "integrated_graphics": part.specs.get("Integrated Graphics", None) if part.specs else None,
                                                "maximum_supported_memory": part.specs.get("Maximum Supported Memory", None) if part.specs else None,
                                                "ecc_support": part.specs.get("ECC Support", None) if part.specs else None,
                                                "includes_cooler": part.specs.get("Includes Cooler", None) if part.specs else None,
                                                "packaging": part.specs.get("Packaging", None) if part.specs else None,
                                                "lithography": part.specs.get("Lithography", None) if part.specs else None,
                                                "simultaneous_multithreading": part.specs.get("Simultaneous Multithreading", None) if part.specs else None,
                                                "rating_average": getattr(part.rating, "average", None) if part.rating else None,
                                            }
                                            try:
                                                response = supabase.table("cpus").insert([data]).execute()
                                                logging.info(f"Inserted {data['name']} into database.")
                                            except Exception as e:
                                                logging.error(f"Failed to insert {data['name']} into database: {e}")
                                        else:
                                            warning_count += 1
                                            logging.warning("Part details could not be fetched.")
                                        break  # Exit the retry loop if successful
                                    except AttributeError as e:
                                        if "'NoneType' object has no attribute 'text'" in str(e):
                                            input("Verify link and press Enter to continue...")
                                        else:
                                            raise e
                                    except Exception as e:
                                        error_count += 1
                                        logging.error(f"Error fetching part details: {e}")
                                        break
                    else:
                        logging.info(f"No more results for {product} on page {page}.")
                        break  # Exit loop if no more results
                    page += 1
                    await asyncio.sleep(4)  # Prevent hitting rate limits
                except HTTPError as e:
                    error_count += 1
                    logging.error(f"HTTP error occurred: {e}")
                    await asyncio.sleep(10)  # Wait before retrying
                except Exception as e:
                    error_count += 1
                    logging.error(f"Unexpected error: {e}")
                    await asyncio.sleep(10)  # Short wait before retrying
                    continue

        # Final Debug Summary
        logging.info("\nDebug Summary:")
        logging.info(f"Total Errors: {error_count}")
        logging.info(f"Total Warnings: {warning_count}")
        logging.info(f"Total Skipped Parts: {len(skipped_parts)}")
        if skipped_parts:
            for name, part_number in skipped_parts:
                logging.info(f"Skipped Part: {name} | Part Number: {part_number}")



# Run the main function
asyncio.run(fetch_top_parts())

@thefakequake
Copy link
Owner

Could you try running the code again with no_js=True passed to the Client constructor?
I think this error is due to pyppeteer, I'm considering removing it from the library by default as it seems to cause more problems than it fixes.

@CoderStylus
Copy link
Contributor Author

CoderStylus commented Jan 21, 2025

Nope, even with pcpp = pypartpicker.Client(response_retriever=response_retriever, no_js=True), still getting the same error. This is the only info I can find on it: https://stackoverflow.com/questions/47392423/python-selenium-devtools-listening-on-ws-127-0-0-1?rq=2

@thefakequake
Copy link
Owner

Just to check, are you on the latest version?

@CoderStylus
Copy link
Contributor Author

Yep, v2.0.5

@thefakequake
Copy link
Owner

I'm not fully sure if its requests-html in the library or requests-html in your code which is causing the problem. If you disable proxy rotation entirely, and use the default response_retriever, do you still get the same error?
This may be a bit difficult to test, thank you for bearing with me here.

@CoderStylus
Copy link
Contributor Author

Nope. Exact same thing after default response_retriever and no proxies.

However, while its running, I can see multiple, sometimes ten to fifteen, chromium processes that just sit there. It seems like they're not getting closed properly, even though I added code to do it manually when it wasn't doing it itself? Other than that no ideas.

Image

@CoderStylus
Copy link
Contributor Author

The program without proxies, if you are curious:

import pypartpicker
import time
import random
from supabase import create_client, Client
import logging
from requests.exceptions import HTTPError
import asyncio

# Setup logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

# Initialize Supabase client
url = "CENSORED"  
key = "CENSORED"
try:
    supabase: Client = create_client(url, key)
    logging.info("Connected to Supabase successfully.")
except Exception as e:
    logging.error(f"Failed to connect to Supabase: {e}")
    exit(1)

# Supported Product Types
CPU_PRODUCTS = ["Intel Core i3", "Intel Core i5", "Intel Core i7", "Intel Core i9", "Intel Xeon", "AMD Ryzen 3", "AMD Ryzen 5", "AMD Ryzen 7", "AMD Ryzen 9", "AMD Threadripper", "AMD Athlon 3000G", "AMD Athlon 200GE", "Intel Pentium G6400", "Intel Pentium 5", "Intel Pentium 6", "Intel Pentium 7", "Intel Celeron G4", "Intel Celeron G5", "Intel Celeron G6"]

# Initialize PyPartPicker Client
pcpp = pypartpicker.Client(no_js=True)

async def fetch_top_parts():
    async with pypartpicker.AsyncClient() as pcpp:
        # Counters for debugging
        error_count = 0
        warning_count = 0
        skipped_parts = []

        # Iterate through each product type and fetch all results
        for product in CPU_PRODUCTS:
            page = 1
            while True:
                try:
                    logging.info(f"Fetching {product} parts on page {page}...")
                    result = await pcpp.get_part_search(product, page=page, region="au")
                    if result and result.parts:
                        for part_summary in result.parts:
                            if part_summary and part_summary.url:
                                while True:
                                    try:
                                        logging.info(f"Fetching details for {part_summary.url}")
                                        part = await pcpp.get_part(part_summary.url)
                                        if part:
                                            # Validate and prepare data for insertion
                                            in_stock_vendors = [
                                                vendor for vendor in part.vendors if vendor.in_stock
                                            ] if part.vendors else []
                                            in_stock_vendors.sort(key=lambda v: v.price.total if v.price else float('inf'))

                                            cheapest_vendor = in_stock_vendors[0] if in_stock_vendors else None
                                            data = {
                                                "part_type": "processor",
                                                "name": part.name if part.name else None,
                                                "total_price": cheapest_vendor.price.total if cheapest_vendor and cheapest_vendor.price else None,
                                                "base_price": cheapest_vendor.price.base if cheapest_vendor and cheapest_vendor.price else None,
                                                "discounts": cheapest_vendor.price.discounts if cheapest_vendor and cheapest_vendor.price else None,
                                                "shipping_price": cheapest_vendor.price.shipping if cheapest_vendor and cheapest_vendor.price else None,
                                                "tax_price": cheapest_vendor.price.tax if cheapest_vendor and cheapest_vendor.price else None,
                                                "vendor_store": getattr(cheapest_vendor, "name", "N/A") if cheapest_vendor else None,
                                                "store_product_url": getattr(cheapest_vendor, "buy_url", "N/A") if cheapest_vendor else None,
                                                "vendor_logo_url": getattr(cheapest_vendor, "logo_url", "N/A") if cheapest_vendor else None,
                                                "in_stock": bool(in_stock_vendors) and cheapest_vendor is not None,
                                                "product_url": getattr(part, "url", "N/A"),
                                                "image_urls": part.image_urls if part.image_urls else None,
                                                "manufacturer": part.specs.get("Manufacturer", None) if part.specs else None,
                                                "part_number": part.specs.get("Part #", None) if part.specs else None,
                                                "series": part.specs.get("Series", None) if part.specs else None,
                                                "microarchitecture": part.specs.get("Microarchitecture", None) if part.specs else None,
                                                "core_family": part.specs.get("Core Family", None) if part.specs else None,
                                                "socket": part.specs.get("Socket", None) if part.specs else None,
                                                "core_count": part.specs.get("Core Count", None) if part.specs else None,
                                                "thread_count": part.specs.get("Thread Count", None) if part.specs else None,
                                                "performance_core_clock": part.specs.get("Performance Core Clock", None) if part.specs else None,
                                                "performance_core_boost_clock": part.specs.get("Performance Core Boost Clock", None) if part.specs else None,
                                                "l2_cache": part.specs.get("L2 Cache", None) if part.specs else None,
                                                "l3_cache": part.specs.get("L3 Cache", None) if part.specs else None,
                                                "tdp": part.specs.get("TDP", None) if part.specs else None,
                                                "integrated_graphics": part.specs.get("Integrated Graphics", None) if part.specs else None,
                                                "maximum_supported_memory": part.specs.get("Maximum Supported Memory", None) if part.specs else None,
                                                "ecc_support": part.specs.get("ECC Support", None) if part.specs else None,
                                                "includes_cooler": part.specs.get("Includes Cooler", None) if part.specs else None,
                                                "packaging": part.specs.get("Packaging", None) if part.specs else None,
                                                "lithography": part.specs.get("Lithography", None) if part.specs else None,
                                                "simultaneous_multithreading": part.specs.get("Simultaneous Multithreading", None) if part.specs else None,
                                                "rating_average": getattr(part.rating, "average", None) if part.rating else None,
                                            }
                                            try:
                                                response = supabase.table("cpus").insert([data]).execute()
                                                logging.info(f"Inserted {data['name']} into database.")
                                            except Exception as e:
                                                logging.error(f"Failed to insert {data['name']} into database: {e}")
                                        else:
                                            warning_count += 1
                                            logging.warning("Part details could not be fetched.")
                                        break  # Exit the retry loop if successful
                                    except AttributeError as e:
                                        if "'NoneType' object has no attribute 'text'" in str(e):
                                            input("Verify link and press Enter to continue...")
                                        else:
                                            raise e
                                    except Exception as e:
                                        error_count += 1
                                        logging.error(f"Error fetching part details: {e}")
                                        break
                    else:
                        logging.info(f"No more results for {product} on page {page}.")
                        break  # Exit loop if no more results
                    page += 1
                    await asyncio.sleep(4)  # Prevent hitting rate limits
                except HTTPError as e:
                    error_count += 1
                    logging.error(f"HTTP error occurred: {e}")
                    await asyncio.sleep(10)  # Wait before retrying
                except Exception as e:
                    error_count += 1
                    logging.error(f"Unexpected error: {e}")
                    await asyncio.sleep(10)  # Short wait before retrying
                    continue

        # Final Debug Summary
        logging.info("\nDebug Summary:")
        logging.info(f"Total Errors: {error_count}")
        logging.info(f"Total Warnings: {warning_count}")
        logging.info(f"Total Skipped Parts: {len(skipped_parts)}")
        if skipped_parts:
            for name, part_number in skipped_parts:
                logging.info(f"Skipped Part: {name} | Part Number: {part_number}")

# Run the main function
asyncio.run(fetch_top_parts())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants