Make readiness timeout configurable

Question

Make readiness timeout configurable

mohammedi-haroune opened this issue 3 years ago · comments

I'm having an issue integrating cdsdashboards to our ML platform http://iko.ai, my dashboard server is timing out during readiness and there is no way to configure the time out value because it's hardcoded here: https://github.com/ideonate/jhsingle-native-proxy/blob/master/jhsingle_native_proxy/proxyhandlers.py#L700

I'd like to be able to configure the readiness timeout value through command-line argument and/or environment variables?

Debug logs:

INFO:tornado.application:SuperviseAndProxyHandler http_get 55527
DEBUG:tornado.application:Allowing Hub user USER_NAME
DEBUG:tornado.application:Storing origin host SERVER_NAME
INFO:tornado.application:['streamlit', 'run', '/home/USER_NAME/./vosk_app.py', '--server.port=55527', '--server.headless=True', '--browser.             serverAddress=SERVER_NAME', '--browser.gatherUsageStats=false']
DEBUG:tornado.application:Trying to start mainprocess
DEBUG:tornado.application:Started mainprocess
DEBUG:tornado.application:Connection to http://localhost:55527/ refused
DEBUG:tornado.application:Readyness: False after 0.00495457649230957 seconds, next check in 0.01s
DEBUG:tornado.application:Connection to http://localhost:55527/ refused
DEBUG:tornado.application:Readyness: False after 0.017713308334350586 seconds, next check in 0.02s
DEBUG:tornado.application:Connection to http://localhost:55527/ refused
DEBUG:tornado.application:Readyness: False after 0.04067730903625488 seconds, next check in 0.04s
DEBUG:tornado.application:Connection to http://localhost:55527/ refused
DEBUG:tornado.application:Readyness: False after 0.08370685577392578 seconds, next check in 0.08s
DEBUG:tornado.application:Connection to http://localhost:55527/ refused
DEBUG:tornado.application:Readyness: False after 0.1665360927581787 seconds, next check in 0.16s
DEBUG:tornado.application:Connection to http://localhost:55527/ refused
DEBUG:tornado.application:Readyness: False after 0.329953670501709 seconds, next check in 0.32s
DEBUG:tornado.application:Connection to http://localhost:55527/ refused
DEBUG:tornado.application:Readyness: False after 0.6533911228179932 seconds, next check in 0.64s
DEBUG:tornado.application:Connection to http://localhost:55527/ refused
DEBUG:tornado.application:Readyness: False after 1.2969000339508057 seconds, next check in 1.28s
DEBUG:tornado.application:Connection to http://localhost:55527/ refused
DEBUG:tornado.application:Readyness: False after 2.5811028480529785 seconds, next check in 2.56s
DEBUG:tornado.application:Connection to http://localhost:55527/ refused
DEBUG:tornado.application:Readyness: False after 5.144717454910278 seconds, next check in 4.847672929763794s
DEBUG:tornado.application:Connection to http://localhost:55527/ refused
DEBUG:tornado.application:Readyness: False after 9.998777389526367 seconds, next check in -0.0035436248779296877s
DEBUG:tornado.application:Connection to http://localhost:55527/ refused
DEBUG:tornado.application:Readyness: False after 10.007751226425171 seconds, next check in -0.007087249755859375s

Dan Lester · Answer 1 · Mon Apr 12 2021 19:26:38 GMT+0800 (China Standard Time)

This idea probably makes sense, and I would like to add it as an option at some point.

However, it may be more efficient for you to use the --ready-check-path=/ready-check argument instead. This allows you to specify an alternative URL (e.g. /ready-check) that will tell jhsingle-native-proxy if the web server is ready. It is inefficient to request the main Streamlit app to do this really - maybe you can choose the URL to a smaller static asset from Streamlit that at least would indicate the Streamlit web server is ready to receive requests. This should be able to respond much quicker, and ultimately means that JupyterHub can redirect users to the dashboard sooner, so their request for the main Streamlit app can begin sooner too.

Please let me know if that helps.

Haroune Mohammedi · Answer 2 · Mon Apr 12 2021 19:59:27 GMT+0800 (China Standard Time)

I'll try to set --ready-check-path=/healthz for streamlit and what this what it gives

Haroune Mohammedi · Answer 3 · Tue Apr 13 2021 16:06:08 GMT+0800 (China Standard Time)

Unfortunately, I couldn't reproduce the bug again, even though --ready-check-path=/healthz is working for streamlit dashboards, we can't be sure it fixes the bug until reproducing it.

I'll get back whenever the bug is reproduced.

Dan Lester · Answer 4 · Tue Apr 13 2021 18:51:19 GMT+0800 (China Standard Time)

Thanks for the update, but just to clarify do you mean:

Adding --ready-check-path=/healthz seems to work well for your dashboard starting OK. But that you no longer have the problem anyway, even if you drop --ready-check-path again.

?

Haroune Mohammedi · Answer 5 · Tue Apr 13 2021 19:05:11 GMT+0800 (China Standard Time)

Thanks for the update, but just to clarify do you mean:

Adding --ready-check-path=/healthz seems to work well for your dashboard starting OK. But that you no longer have the problem anyway, even if you drop --ready-check-path again.

?

Yes, that's it.

Haroune Mohammedi · Answer 6 · Tue Apr 13 2021 22:30:33 GMT+0800 (China Standard Time)

The bug is still happening even with --ready-check-path=/healthz

Haroune Mohammedi · Answer 7 · Thu Apr 15 2021 21:16:48 GMT+0800 (China Standard Time)

I reproduced the bug with the following flask app:

from flask import Flask
from time import sleep

app = Flask(__name__)

@app.route('/')
def hello_world():
    sleep(30)
    return 'Hello, World!'

and theses configurations:

c.CDSDashboardsConfig.presentation_types = ['flask']

c.VariableMixin.extra_presentation_launchers = {
    'flask': {
        'args': [
            'flask', 'run', '{--}port={port}'
        ],
        'env': {'FLASK_APP': 'app.py'},
    }
}

Haroune Mohammedi · Answer 8 · Thu Apr 15 2021 22:29:10 GMT+0800 (China Standard Time)

A better reproduction app:

from flask import Flask
from time import sleep, time

app = Flask(__name__)

started_at = time()
ready_after = 30

@app.route('/')
def hello_world():
    now = time()
    if now - started_at > ready_after:
        return 'Appliation ready!'
    else:
        # sleep for more that 1 seconds to make the _ready_check call fail: 
        # https://github.com/yuvipanda/simpervisor/blob/master/simpervisor/process.py#L196
        # Even status code 400 is considered as ready as per https://github.com/ideonate/jhsingle-native-proxy/blob/master/jhsingle_native_proxy/proxyhandlers.py#L707
        # return 'Application not yet ready! Please try later', 400
        sleep(5)

Why better?

The first app will always fail to start even if we increase the ready_timeout, it fails because the ready_func call is only waited for 1 second in https://github.com/yuvipanda/simpervisor/blob/master/simpervisor/process.py#L196

Dan Lester · Answer 9 · Tue Apr 20 2021 21:15:56 GMT+0800 (China Standard Time)

By the way, it would be great if you'd like to take a look at the experimental Flask framework:

https://cdsdashboards.readthedocs.io/en/stable/chapters/userguide/frameworks/flask.html

That should be more flexible for being able to specify a flask script to run, and it does it through gunicorn instead of the dev server. There may be some tweaks needed...