rverton / webanalyze

When I use more workers, I get "Failed to Retrieve" on a lot of URLs that worked at a lower amount of workers. The more workers I added, the more "Failed to Retrieve". Any ideas as to why this may be happening?

Hi @earwickerh,
there is a hardcoded timeout for retrieving content currently at 8 seconds (https://github.com/rverton/webanalyze/blob/master/webanalyze.go#L21). Why 8? To be honest, I dont know. It worked very well for me until now.

My guess is that when you increase the amount of workers, you are reaching the limit of your bandwidth and this results in some hosts taking more time to respond.

Are you using the source code release? If yes, you can try to increase the hardcoded value. If this does change it, I can make it as an option.

Great insight, this seemed to do the trick. However, I am running more tests to ensure I'm not inserting new variables in the experiments. I'll update more conclusively.

I don't think that's it after all. After further testing, I believe it is related to CSV output (probably indirectly). A resource limit of some sort is being hit, once reached, I get "Failed to Retrieve" exclusively for all remaining URLs, even though those same URLs resolve properly when running with fewer workers/smaller set of URLs/individual host.

The issue occurs particularly when using the combination of more workers AND csv output. When I run with stdout as the output, the issue seems to go away/be significantly reduced even with 100+ workers. Which leads me to believe CSV isn't the root cause, but rather a contributing factor in reaching a resource limit.
I hope this won't be too difficult to replicate on your end.

Small notes for which i can create pull request/other issues for tracking:

stdout provides a: output when the site is successfully loaded but matches aren't found. This would be a nice addition to the csv output as well,signaling the site was reached but no matches were found.

-"search" argument. It's great that it goes through subdomains but ssl vs no ssl isn't taken into account. My hosts file contains the http://, https:// and https://www. versions of the URLs I want to scan to get around this. Having the tool test ssl/non-ssl versions could prove useful.

Hi,
yeah I'm not quite sure how I can debug this. I'll ask around and let you know when I found a good way to benchmark/debug this.

Regarding your other suggestions: I'm happy for every contribution :)

Greetings

@earwickerh can you patch the code at this line:

webanalyze/webanalyze.go

Line 208 in fb291b5

return nil, links, fmt.Errorf("Failed to retrieve")

and print out the real error (fmt.Printf("error retrieving: %v\n", err)) and run it again?

Great, thank you! I'll do this and report back shortly. Thanks again

I gave this a shot but I get the following compile error "webanalyze/webanalyze.go:208:33: multiple-value fmt.Printf() in single-value context" after changing fmt.Errorf("Failed to retrieve") to fmt.Printf("error retrieving: %v\n", err). Thank for your help

I just commited the improved error reporting, you can pull the changes and then test again.

5c9aebf

whoa, that was quick, thanks! It's running. I'll let you know what I find. Thanks again

After implementing I only see one new error being display (and only one instance of this new error) but it doesn't seem related as it showed up a good few minutes before my issue began to reoccur. 'Unsolicited response received on idle HTTP channel starting with "HTTP/1.1 100 Continue\r\n\r\n"; err=

Here's the command I'm using: webanalyze -hosts crm-url-cleaner.txt -worker 200 -crawl 12 -output csv > results-200-c12-csvout.csv 2> err-w200-c12-csvout.txt
After a while, the results seem to just stop being written to the csv, meanwhile the errors continue to be written to the error file.

The same command, without the "-output csv" works fine...

Should we be trying to report errors for csv specifically around line 144 here (my apologies for my ignorance)

Thanks for taking a look

The error handler for retrieving is not dependent on the output method, as you can see it's done before the output is handled.

Maybe writing csv is failing because it's done from multiple goroutines. It's odd because we just write to os.Stdout. We can catch the error from csv.Write and see if there is anything wrong here.

Can you try this:

                         err := outWriter.Write(
				[]string{
					result.Host,
					strings.Join(m.CatNames, ","),
					m.AppName,
					m.Version,
				},
			)
                         if err != nil { log.Printf("error writing csv: %v\n", err) }

Any update on this? Otherwise I will close this due to inactivity.

More workes = More "Failed to Retrieve"