Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpResponse of Crawled Page Returns Null in Abot2 #243

Open
Nitish0949 opened this issue Jan 2, 2025 · 1 comment
Open

HttpResponse of Crawled Page Returns Null in Abot2 #243

Nitish0949 opened this issue Jan 2, 2025 · 1 comment

Comments

@Nitish0949
Copy link

Encountered an issue using the Abot2 package while crawling webpages. The httpResponse property of the crawled page (entity.CrawledPage.HttpResponseMessage) intermittently returns null for some pages.

Key Observations
This issue does not occur for all pages but only for certain ones.
The number of pages with a null httpResponse varies between different crawl runs.

Expected Behavior
The HttpResponseMessage should provide the HTTP response for all crawled pages.

Actual Behavior
The HttpResponseMessage is null for some pages, and the occurrence of these pages is inconsistent between crawl runs.

Additional Information
Error Message: When the issue occurs, the following error is logged:
The SSL connection could not be established, see inner exception.
Inner Exception: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.

This issue appears to be related to handling HTTPS connections or certain server configurations.

Steps Already Taken
Verified SSL/TLS settings and configurations.
Checked network connectivity and ensured the target URLs are reachable.
Observed that this issue is not URL-specific but varies across crawl runs.
Executed crawler with different configuration, below are some config which have been used
config 1:

public CrawlConfiguration CrawlConfig(int maxPages)
{
    ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
    CrawlConfiguration crawlConfig = new()
    {
        MaxConcurrentThreads = 1,
        MinCrawlDelayPerDomainMilliSeconds = 1000,
        IsSslCertificateValidationEnabled = false,
        MaxPagesToCrawl = 5000,
        HttpRequestTimeoutInSeconds = 30,
        MaxRetryCount = 5,
        MinRetryDelayInMilliseconds = 5,
        CrawlTimeoutSeconds = 5000,
    };
    return crawlConfig;
}

Config 2:

public CrawlConfiguration CrawlConfig(int maxPages)
{
    ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
    CrawlConfiguration crawlConfig = new()
    {
        MaxConcurrentThreads = 1,
        MinCrawlDelayPerDomainMilliSeconds = 1000,
        IsSslCertificateValidationEnabled = true,
        MaxPagesToCrawl = 5000,
        HttpRequestTimeoutInSeconds = 300,
        MaxRetryCount = 5,
        MinRetryDelayInMilliseconds = 5000,
        CrawlTimeoutSeconds = 5000,
    };
    return crawlConfig;
}

image

Would appreciate assistance in diagnosing and resolving this issue, or confirmation if this is a known bug

@Nitish0949
Copy link
Author

@sjdirect, could you provide any suggestions regarding this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant