Not detecting 404s on certain pages

Question

Not detecting 404s on certain pages

mattman00000 opened this issue 8 years ago · comments

mattman00000 commented 8 years ago

On some sites I get a 404 in wget but fuskr still loads an image. Examples include I think all of foto.my.mail.ru

Dan Atkinson · Answer 1 · Mon Apr 11 2016 16:19:05 GMT+0800 (China Standard Time)

Please can you provide an example of a 404 image url?

mattman00000 · Answer 2 · Thu Apr 21 2016 15:55:32 GMT+0800 (China Standard Time)

Sorry for the delay, http://content.foto.mail.ru/mail/thereisnowhere/_myphoto/s-[1-11].jpg

Dan Atkinson · Answer 3 · Thu Apr 21 2016 18:46:04 GMT+0800 (China Standard Time)

@jbolster,

Any ideas? I can see the problem but I don't think jQLite returns the statuscode on the 'load' event so I don't see how we could interrogate the response from the server.

Currently, we only care if the file has been loaded, not whether it's been loaded but with an error.

Example Fusk url:
chrome-extension://balbojkopkiehjjnmpohcobpejmioppl/Html/images.htm#/fusk/http://content.foto.mail.ru/mail/thereisnowhere/_myphoto/s-[1-3].jpg

Thanks, Dan

mattman00000 · Answer 4 · Thu Apr 21 2016 19:15:42 GMT+0800 (China Standard Time)

I did make a stopgap bookmarklet a while back (no idea if it works with the Angular update) to remove images the same dimensions as a selected image

javascript:if ((getSelection.rangeCount==1)||(getSelection().getRangeAt(0).commonAncestorContainer.getElementsByTagName("img").length!=0)){var twidth = getSelection().getRangeAt(0).commonAncestorContainer.getElementsByTagName("img")[0].width;var theight = getSelection().getRangeAt(0).commonAncestorContainer.getElementsByTagName("img")[0].height;var fuskImages = document.getElementsByClassName("fuskImage");for (var i = 0;i<fuskImages.length;i++){if ((fuskImages[i].width==twidth)&&(fuskImages[i].height==theight)){fuskImages[i].parentNode.parentNode.className="hide wrap error" ;}}}else {alert("select a picture first");}

I had a thought that it might not be a bad idea to have some form of "remove images whose dimensions (or perhaps other properties) match some certain criteria" functionality anyway. Whether it's fancy upper and lower bound sliders like those of "Image Downloader" (chrome extension id cnpniohnfphhjihaiiggeabnkjhpaldj) or just width and height comparison operator dropdowns and an HTML5 number input field (and a and/or dropdown wouldn't hurt either). I could give it a try but I'm a bit rusty on chrome extensions and I'm tied up until some time in May.

Dan Atkinson · Answer 5 · Thu Apr 21 2016 19:34:54 GMT+0800 (China Standard Time)

Hi Matt,

This is a really tricky thing to do properly and I actually considered this a few years ago to handle hotlink warnings. Ultimately though, Fuskr seemed to 'beat' hotlink checks so I didn't bother implementing it.

I'm all for adding the functionality but it seems like this other extension (GitHub project page) (which I'd never seen before today!) does it in a way I probably wouldn't have. I think filtering by dimensions is okay, but there are probably other ways of doing it, like grabbing a hash of an image and removing images with a duplicate hash without the requirement for user intervention. Possibly an overhead here though.

Also, it allows the user to create a folder when downloading. I didn't think that was possible in the API. I may have to look at that functionality a little further. I like the popup idea, but couldn't implement it in Fuskr due to Google's 'single purpose' rules on extensions which this might break.

Thanks, Dan

image-downloader GitHub project page

Jonathon Bolster · Answer 6 · Fri Apr 22 2016 18:27:49 GMT+0800 (China Standard Time)

This might actually be possible.

Originally I was commenting that it wouldn't be possible due to CORS, but according to the XHR documentation extensions aren't limited in the same way (hurrah) as long as you specify the sites to have permission for. And we happen to already have the permissions for all sites.

This surprised me, but it's great to know!

So I'm thinking of these options:

Request the images via JS first. Displaying them as we are currently (there will be a second request per image but it will hit the cache).
Do what we do already and before it counts as 'success', do a quick request for the image to get the headers.
Request the image, stick the data in a canvas.

I'm thinking of option 2 here. If we do a request after the image has loaded, then it will go straight to the cache anyway, and just acts as a confirmation.

Dan Atkinson · Answer 7 · Fri Apr 22 2016 18:38:49 GMT+0800 (China Standard Time)

I looked at option 1 previously which does sound like option 2. Both involve more than one request. If you're just doing a HEAD request, that should be quick and return the relevant status, but this feels like we're just trying to get around the problem by making twice the number of requests.

Canvas is one way to go and would allow us to do hashing and some rudimentary image analysis more easily, but I imagine that the cost is quite high.

Jonathon Bolster · Answer 8 · Fri Apr 22 2016 18:47:32 GMT+0800 (China Standard Time)

Option 1 is JS before the img element makes the request
Option 2 is allowing the img tag to load before failing/successing

Both do make twice the number of requests, but if the image was previously successful then it would just hit the cache anyway. This is why option 2 is now out for me, as the browser doesn't look at the cache when it previously failed (so an actual second network request).

With option 3, we could just stick the response data in an image element which could work: http://stackoverflow.com/a/10687544/261677

The fact that we can actually make these CORS requests in JS opens up a new option, of potentially zipping up images before saving (though that's another issue altogether)

Dan Atkinson · Answer 9 · Fri Apr 22 2016 18:56:05 GMT+0800 (China Standard Time)

That's fine though. We don't need to perform requests on failures - just successes (loaded image) that return a 4xx (client error) or 5xx (server error). So we just check whether the status code is between 400-600.

2xx (success) or 3xx (redirection) should be considered successful along with everything else.

Re your zipping suggestion, what do you mean? Would you mind breaking it out into a separate issue or perhaps discuss in Gitter?

Dan Atkinson · Answer 10 · Fri Apr 22 2016 21:02:58 GMT+0800 (China Standard Time)

Fixed by #20.

Dan Atkinson · Answer 11 · Thu May 03 2018 21:13:33 GMT+0800 (China Standard Time)

I've looked at how this can be fixed again and have made a small change to detect data types. I've found a couple of cases where calling non-existent images results in a 200 response, but the response is actually a web page which can't be rendered as an image.

We could/should also check for blob data size as well, but for now, this should stop image galleries being loaded where the response is clearly invalid.