Are 304 responses properly handled?
replaysMike opened this issue · comments
I've tried the following code as an example but I always get a timeout from this particular URL (I've tried other URLs that work fine). The URL being requested returns a 304, Not Modified and I checked with Curl and it gets the body content no problem and there don't appear to be any redirects. The PageCrawlDisallowed
event is triggered with a reason: Page has no content
, but the PageCrawlCompleted
event shows a failure/timeout trying to read the body content.
e.CrawledPage.HttpRequestException:
exception occurred with the originating request: 'Request timeout occurred The request was canceled due to the configured HttpClient.Timeout of 60 seconds elapsing. The operation was canceled
// ,net 6 console app
var config = new CrawlConfiguration
{
HttpRequestTimeoutInSeconds = 60,
CrawlTimeoutSeconds = 60,
MaxConcurrentThreads = 10,
};
using var crawler = new PoliteWebCrawler(config);
crawler.PageCrawlCompleted += PageCrawlCompleted;
crawler.PageCrawlDisallowed += PageCrawlDisallowed;
var uri = new Uri("https://www.ti.com/amplifier-circuit/current-sense/analog-output/overview.html");
var crawlResult = await crawler.CrawlAsync(uri);
private void PageCrawlCompleted(object? sender, PageCrawlCompletedArgs e)
{
var httpStatus = e.CrawledPage.HttpResponseMessage?.StatusCode;
var rawPageText = e.CrawledPage.Content?.Text;
// e.CrawledPage.HttpResonseMessage is null, the HttpRequestException is set
}
Is this a bug with processing certain URLs or am I missing something silly?
Problem solved, not an issue with the library. The UserAgent I was using must have some special blocking on this website. I was using Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
and it just black hole's the connection.
Thanks for the follow up.