webp-sh / webp_server_go

Go version of WebP Server. A tool that will serve your JPG/PNG/BMP/SVGs as WebP/AVIF format with compression, on-the-fly.

Home Page:https://docs.webp.sh

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some webp images corrupted in first time when using nginx reverse proxy

bhzhu203 opened this issue · comments

When using nginx reverse proxy to forward requests to webbp_server , there are some corrupted webp images in the browser often(need to do a ctrl+F5 to force refresh ).
But when directly access the webbp_server , it is always normal. Is the nginx reverse proxy does not support the Etag ?

2023-05-23_11-15

图片

2023-05-23_10-26

图片

nginx reverse configuration

      location @webp {
        resolver 223.5.5.5 ipv6=off valid=60s;
        proxy_pass http://static-webp.selleroa.com;

        #Proxy Settings
        proxy_redirect     off;
        proxy_set_header   Host           static-webp.selleroa.com ;
        proxy_set_header   X-Real-IP        $remote_addr;
        proxy_set_header   X-Forwarded-For  $proxy_add_x_forwarded_for;
        proxy_set_header   accept 'image/webp';
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
        proxy_connect_timeout      19;
        proxy_send_timeout         19;
        proxy_read_timeout         19;
        proxy_buffer_size          32k;
        proxy_buffers              8 64k;
        proxy_busy_buffers_size    164k;
        proxy_temp_file_write_size 164k;

#        proxy_cache            cdn2;
#        proxy_cache_valid      200 206 304  60s;
#        add_header X-Proxy-Cache $upstream_cache_status;
#        proxy_cache_revalidate on;

#        proxy_cache_background_update on;
#        proxy_cache_use_stale error timeout http_500 http_502 http_503 http_504;

        proxy_hide_header Access-Control-Allow-Origin;
        proxy_hide_header Access-Control-Allow-Methods;
        proxy_hide_header Access-Control-Allow-Headers;
        proxy_hide_header Access-Control-Max-Age;
        proxy_hide_header etag;

        add_header 'Access-Control-Allow-Methods' 'GET,POST,OPTIONS' always;
        add_header 'Access-Control-Allow-Origin' $http_origin always;
        add_header 'Access-Control-Allow-Credentials' 'true' always;
        add_header 'Access-Control-Max-Age' '172800' always;

        proxy_ignore_headers Vary;
      }

update :
1 .release 0.60~0.7.0 have no this issue . But after 0.8.0 , I can reproduce this issue.
2. when I delete/clean the "EXHAUST_PATH" directory ,I can refresh out the normal wepb images

Thanks for feedback, from the info you've provided, your scenario is as below:

Am I understanding right?


there are some corrupted webp images in the browser often(need to do a ctrl+F5 to force refresh ).

Could you please give us some more details about this, like the Response Headers when webp images are not displaying normally, and are there any error logs on the WebP Server?

Yes, you are right. You could test this unormal webp image

Please modify the hosts
47.110.152.133 cdn2.selleroa.com

http://cdn2.selleroa.com/null/skupub/DCBJ16672/img/f16fd3df9425ceeeb3948a9b9e94b570.jpg

http://cdn2.selleroa.com/null/skupub/DCBJ16672/img/f16fd3df9425ceeeb3948a9b9e94b570.jpg

curl http://cdn2.selleroa.com/null/skupub/DCBJ16672/img/f16fd3df9425ceeeb3948a9b9e94b570.jpg -H 'accept: image/webp'

It is odd , it often happens in the first time when you access them.

Here is the sample of webp images. Seems that the nginx download the early part of webp file , not completed.
test.zip

I have found the commit which can cause this issue : ccc99e2

As I know the good release is 0.60~0.7.0 , so I try to revert the commit 0.7.0 ~0.8.0 one by one.

图片

After I revert the commit ccc99e2 ,everything is fine .

Only this commit affect the proxy mode

update : 1 .release 0.60~0.7.0 have no this issue . But after 0.8.0 , I can reproduce this issue. 2. when I delete/clean the "EXHAUST_PATH" directory ,I can refresh out the normal wepb images

Continuing from issue #215, what does [I need to do a hard refresh to show it manually] mean?

Forgot to tell, you might need to clean up remote-raw and exhaust directory before using code in PR #216

Continuing from issue #215, what does [I need to do a hard refresh to show it manually] mean?

Forgot to tell, you might need to clean up remote-raw and exhaust directory before using code in PR #216

I have clean up these two directory that is still no use . Seems that the files first time access are not compeleted , need to wait some time and do ctrl+F5 hard refresh

图片

图片

Like this image is corrupted first , there is 304 status
图片
after one time 200 status
The image becomes normal

Could we revert the commit ccc99e2 ? I had sovled this issue by reverting it

Could we revert the commit ccc99e2 ? I had sovled this issue by reverting it

No, we cannot simply revert commits, we need to figure out potential problems and fix them, @bugfest could you please help us take some time to a look on this issue?

I recall from your first comment:

But when directly access the webbp_server , it is always normal. Is the nginx reverse proxy does not support the Etag ?

Will that be some problem with your Nginx config?🤔

Sure @n0vad3v. @bhzhu203 can you share:

  • your webp_server_go config file?
  • proxy reverse config you're using to expose webp_server_go (http://static-webp.selleroa.com)

Some more questions:

  • are you running webp_server_go directly via systemd or in docker? It the deamon up all the time when you reproduce the issue or it get restarted at any point?
  • can you add logging restart all the proxies and daemon (in debug mode with the -v flag); then try to reproduce the issue using curl, just with that single image and share the logs of all the components involved (proxy front, proxy webp, webp, backend/origin?

Also, have you reproduced the issue in running your browser in private mode? Those 304 are HTTP not modified responses so that your proxy might not have been able to serve their content properly on the first try?

Ok, I have a theory

  1. You request http://cdn2.selleroa.com/null/skupub/DCBJ16672/img/f16fd3df9425ceeeb3948a9b9e94b570.jpg
  2. Then the proxy reverse send the request to http://static-webp.selleroa.com/yfni/test/1684547034231096980.jpg
  3. Then that other proxy reverse sends it to webp_server_go which then logs the following:
Remote Addr is //static-erp.oss-cn-shenzhen-internal.aliyuncs.com/yfni/test/1684547034231096980.jpg, fetching info...
<time> HTTP_CODE ...

The log entry is generated at

log.Infof("Remote Addr is %s, fetching...", realRemoteAddr)

That HTTP_CODE is the one returned by webp_server_go's http server. So it's the code returned to the client, all they way back to the browser. That means that the code is not generating any more logs. Reading the code that only leaves us the case that your backend is responding (for some reason) with http status code 304 and the code block being executed in that case is

msg := fmt.Sprintf("Remote returned %d status code!", statusCode)

Until your backend does not returns a 200/ok, the webp daemon keeps responding with an error blob; probably if you download the broken image you're getting you'll get a text message: Remote returned 304 status code! string.

Why does it work for you in the previus version... I'm not sure, probably I'll have to read the code more carefully but I'd say by the time you change version, the backend cache has expired and you get a 200/ok in the first attempt so that image gets cached properly in your server.

I'd say, you first should fix the issue with your backend and investigating why it's responding all those 304's.

@n0vad3v, on our side I think we should think on a retry method or just fail with 500 error if the configured backend is not beheving properly.

I think we should think on a retry method or just fail with 500 error if the configured backend is not beheving properly.

Agree, maybe we should add additional check on fetchRemoteImage to make sure remote is correctly returning 200 status code before downloading and try converting images. (Maybe it can be done within PR #216)

A little bit off-topic, when checking remote backend, res.StatusCode != 404 or res.StatusCode == 200 which one will be better?🤔

I think we should think on a retry method or just fail with 500 error if the configured backend is not beheving properly.

Agree, maybe we should add additional check on fetchRemoteImage to make sure remote is correctly returning 200 status code before downloading and try converting images. (Maybe it can be done within PR #216)

A little bit off-topic, when checking remote backend, res.StatusCode != 404 or res.StatusCode == 200 which one will be better?thinking

old stable version
 curl http://xx.xx.xx.xx:3333/yfni/test/16777485345260ae9ec.jpg -I  -H 'accept: image/webp'
HTTP/1.1 200 OK
Server: Webp Server Go
Date: Fri, 26 May 2023 02:28:37 GMT
Content-Type: image/webp
Content-Length: 415050
Accept-Ranges: bytes
Last-Modified: Fri, 26 May 2023 02:20:17 GMT


new version
curl http://xx.xx.xx.xx:3333/yfni/test/16777485345260ae9ec.jpg -I  -H 'accept: image/webp'
HTTP/1.1 200 OK
Server: Webp Server Go
Date: Fri, 26 May 2023 02:30:46 GMT
Content-Type: image/jpeg
Content-Length: 336947
Etag: W/"336947-EE8C5322"
X-Compression-Rate: 1.00
Accept-Ranges: bytes
Last-Modified: Thu, 25 May 2023 12:42:43 GMT

I have found that the new version has added a etag header. Is the nginx not compatible with the short etag ?

also the Content-Type is wrong

also the Content-Type is wrong

I've tested on my machine for /yfni/test/16777485345260ae9ec.jpg, log as below:

WebP@80%: remote-raw/8c266bc7f9b0cc31e4b06199c35852ce0bed7e92-etag-9160cf7aba11b66406abeb06de0eec70da4c8e3b->exhaust/yfni/test/16777485345260ae9ec.jpg.1685082313.webp_width=0&height=0 336947->356156 105.70% deflated

Since the converted WebP image is bigger than the original image, the original image is returned here, hence Content-Type: image/jpeg

Hello , I have found that this issue only affects the first time converted images.

I nginx I set proxy_set_header if-modified-since ""; to disable 304 request, there is still the same .

After images converted , the issue is gone. Doing anything refreshing is fine.

I think webp server output the conntent dynamically first time , not compeleted to output , and the nginx could not understand the weak etag.Here are the first time the images like:

2023-05-26_14-00

2023-05-26_14-28

etag is no use , not return 304


bash-4.3$ curl 'http://39.108.154.116:3333/yfni/test/1632390011172e9f85c.jpg' \
>   -H 'Accept: image/webp' \
>   -H 'Connection: keep-alive' \
>   -H 'If-Modified-Since: Fri, 26 May 2023 08:00:06 GMT' \
>   -H 'If-None-Match: W/"481228-49F1ED1C"' \
>  -I;
HTTP/1.1 304 Not Modified
Server: Webp Server Go
Date: Fri, 26 May 2023 08:25:10 GMT

bash-4.3$ 
bash-4.3$ 
bash-4.3$ curl 'http://39.108.154.116:3333/yfni/test/1632390011172e9f85c.jpg' \
>   -H 'Accept: image/webp' \
>   -H 'Connection: keep-alive' \
>   -H 'If-None-Match: W/"481228-49F1ED1C"' \
>  -I;
HTTP/1.1 200 OK
Server: Webp Server Go
Date: Fri, 26 May 2023 08:25:11 GMT
Content-Type: image/webp
Content-Length: 481228
Etag: W/"481228-49F1ED1C"
X-Compression-Rate: 0.21
Accept-Ranges: bytes
Last-Modified: Fri, 26 May 2023 08:00:06 GMT

First of all, please notice we use two different HTTP methods here:

  • HEAD is used inconditioanally to check the image metadata (this is sent always, regardless if we have a cached version or not) -
    statusCode, _, _ := getRemoteImageInfo(realRemoteAddr)
  • GET is used when fetching the image when not found in the local cache or we need to update it -
    err := fetchRemoteImage(localRawImagePath, realRemoteAddr)

@bhzhu203 issues are a mix of backend missbehaviour (304 codes in HEAD requests) and network issues (due to high latency between webp_server_go instance and the backend?) that causes the image to not to be totally downloaded.

IMHO, backend issues are totatlly on @bhzhu203 to solve and are not this project's problem. Also, please provide debug logs of your server in the future, it's very hard to troubleshoot from screenshots; lots of guessing here.

We could improve though the image fetching process. I propose we do the following:

  • Check if the image is in the local cache
    • Found in cache:
      • Check is still valid: use HEAD (current method) or (even better) GET + "If-Modified-Since/If-None-Match" (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-None-Match) requests to the backend to download a fresher version.
      • If the backend have a newer version, download it and refresh the local cache (ideally we should do this in a separate thread so we serve the cached/local version in the mean time; this might be too complex for us to develop in the currently)
    • Not found in cache:
      • Download the image with GET

If we implement the "If-Modified-Since/If-None-Match" we could have end up in a simpler process:

  • Get cache file timestamp and etag. If it does not exist return empty
  • Send a GET to the backend with If-Modified-Since/If-None-Match headers
  • If response code is 200, download body and replace/create cache. Check that the dowloaded byte stream length matches the content length in the response headers.
  • If response code is 304 (If-Modified-Since/If-None-Match headers were not empty), serve the local file

Please notice even doing all this, it might not solve @bhzhu203 issues; as I believe these errors are very specific of their setup and network conditions. I suggest they use an https backend to see if that might help with the network problems.

A little bit off-topic, when checking remote backend, res.StatusCode != 404 or res.StatusCode == 200 which one will be better?🤔

@n0vad3v Probably res.StatusCode == 200 as is more specific to a successful response.

The backend server is in a very low latency inner network , not up to 70ms.

Here is just the webpp-server ,no nginx  . You can see etag is not working in webpp-server

curl 'http://39.108.154.116:3333/yfni/test/1632390011172e9f85c.jpg' \
  -H 'Accept: image/webp' \
  -H 'Connection: keep-alive' \
  -H 'If-Modified-Since: Fri, 26 May 2023 08:00:06 GMT' \
  -H 'If-None-Match: W/"481228-49F1ED1C"' ;


curl 'http://39.108.154.116:3333/yfni/test/1632390011172e9f85c.jpg' \
  -H 'Accept: image/webp' \
  -H 'Connection: keep-alive' \
  -H 'If-None-Match: W/"481228-49F1ED1C"' ;
bash-4.3$ curl 'http://39.108.154.116:3333/yfni/test/1632390011172e9f85c.jpg' \
>   -H 'Accept: image/webp' \
>   -H 'Connection: keep-alive' \
>   -H 'If-Modified-Since: Fri, 26 May 2023 08:00:06 GMT' \
>   -H 'If-None-Match: W/"481228-49F1ED1C"' ;
bash-4.3$ 
bash-4.3$ 
bash-4.3$ curl 'http://39.108.154.116:3333/yfni/test/1632390011172e9f85c.jpg' \
>   -H 'Accept: image/webp' \
>   -H 'Connection: keep-alive' \
>   -H 'If-None-Match: W/"481228-49F1ED1C"' ;
Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.

webp server log

INFO[2023-05-26 09:59:09][120:main.proxyHandler()] Remote Addr is http://static-erp.oss-cn-shenzhen-internal.aliyuncs.com/yfni/test/1632390011172e9f85c.jpg, fetching info... 
09:59:08 | 304 |    71ms | 220.184.151.149 | GET     | /yfni/test/1632390011172e9f85c.jpg 
INFO[2023-05-26 09:59:10][120:main.proxyHandler()] Remote Addr is http://static-erp.oss-cn-shenzhen-internal.aliyuncs.com/yfni/test/1632390011172e9f85c.jpg, fetching info... 
09:59:10 | 200 |    15ms | 220.184.151.149 | GET     | /yfni/test/1632390011172e9f85c.jpg 

Hi @bhzhu203, thanks for these; can you confirm this log is from the server in debug mode?

We have the following:

  • your backend is fast/close
  • as I can see in the logs you're not getting the != 200 warning
  • when the request has a proper If-Modified-Since header you get 304

My initial theory might not be totally correct in this case.

The incomplete images might be due to consecutive requests reaching the server:

  1. A first request hits the server, not in the local cache. The server start fetching it from the remote. Then it creates the downloaded file and after we dump the file content from the response body:

    out, err := os.Create(filepath)

  2. A second request hits the server and but the write buffer (?) of the previous ongoing request is not being completely flushed. The file is found and the server can not find the webp version, so that calls the convert function that creates the invalid image.

Regarding the odd behavior on with If-Modified-Since and If-None-Match headers; it might be due to the http server library: https://github.com/gofiber/fiber/blob/bf31f1f3c6e31a434d6489af3b904921820d7bb4/ctx.go#L534

I'm preparing a fix proposal

A second request hits the server and but the write buffer (?) of the previous ongoing request is not being completely flushed.

@bugfest Agree with this possibility, nice catch!
Maybe we can use an in-app-cache for a lock-like operation before the write operation is done, and check if the requesting path has incomplete file write in imageExists function, I'm gonna try to implement one continuing on PR #216.

Hi @n0vad3v,

* For the etag I want to propose two improvements
  
  * the first one using `fiber` etag middleware to handle the client side headers https://github.com/bugfest/webp_server_go/blob/bug213/webp-server.go#L115-L117 - this solves the `If-Modified-Since/If-None-Match` problem on that side.

Opened #218 for this piece

I've solved it by writing to a temp file first and then move/rename it (shat one should be an atomic operation in the filesystem)

In this implementation I'm wondering what will happen when the moment first requests' _, err = io.Copy(out, resp.Body) operation has not finished, then comes the second request and found that the local image doesn't exist, will this cause spawn another thread to download the file again.

My implementation on PR after previous comment is as https://github.com/webp-sh/webp_server_go/pull/216/files#diff-23c2816372c1361306d02f4c71938a8b2b4474cb4b26d711524167ae1175bf87R70-R80, this will create a KV pair to mark the download has not been finished, and will make the latter request on same image hold until it succeed, awaiting @BennyThink 's review on it.

#218 looks good to me, @BennyThink will you please take a look at this PR too?

I've solved it by writing to a temp file first and then move/rename it (shat one should be an atomic operation in the filesystem)

In this implementation I'm wondering what will happen when the moment first requests' _, err = io.Copy(out, resp.Body) operation has not finished, then comes the second request and found that the local image doesn't exist, will this cause spawn another thread to download the file again.

The second request won't find the cached version, trigger a new parallel download and overwrite the cached created by the first one.

My implementation on PR after previous comment is as https://github.com/webp-sh/webp_server_go/pull/216/files#diff-23c2816372c1361306d02f4c71938a8b2b4474cb4b26d711524167ae1175bf87R70-R80, this will create a KV pair to mark the download has not been finished, and will make the latter request on same image hold until it succeed, awaiting @BennyThink 's review on it.

I prefer your approach tbh; not sure about the sync wait time though; I wonder if there's an async way to resume that operation

The latest commit in branch fix-proxy-mode still not fixs the issue. But after I continue merge the branch bugfest:etag-client-304(branch fix-proxy-mode + bugfest:etag-client-304 ) , the pictures are nearly fine , only one or two pictures encounter this issue

Hi @bhzhu203, can you check the uptime of the service? Can you check if it has been killed at some point?

# Assuming Linux systemd system:
sudo journalctl -k -g '(?i)killed'

# Unix
sudo dmesg | grep -i killed

Hi @bhzhu203, can you check the uptime of the service? Can you check if it has been killed at some point?

# Assuming Linux systemd system:
sudo journalctl -k -g '(?i)killed'

# Unix
sudo dmesg | grep -i killed

No killed . It is running stable now , memery is often at 2.4GB .

图片

But after I continue merge the branch bugfest:etag-client-304(branch fix-proxy-mode + bugfest:etag-client-304 ) , the pictures are nearly fine

Nice, I've merged those PRs, and ready to release 0.8.3 of it.

0.8.3 released: https://github.com/webp-sh/webp_server_go/releases/tag/0.8.3
In my understanding, issues initially addressed here are now resolved, and currently the only problem now is high RAM usage? @bhzhu203
If so, we can close this PR and discuss RAM usage problem on #198.

Have'n seen corrputed images these days in the latest version , it is stable now

Nice, now I'm closing this issue.