Fetching latest checkpoint can get stuck

Question

Fetching latest checkpoint can get stuck

morph-dev opened this issue 9 months ago · comments

Many tests on circleci were failing yesterday, and in most cases because there was no output for 10 minutes.
This doesn't happen frequently, but it's not the first time it happens.
It also doesn't happen completely randomly, meaning when it starts happening it affects most tests running at that time (so only solution is to wait and try few hours later).

Luckily, it happened on my machine as well and I noticed that one test was completely stuck: test_fetch_latest_checkpoints.

After a bit of digging, it seems that we are fetching checkpoints from multiple sources and if any of them is stuck, entire function (fetch_latest_checkpoints) is. I'm guessing that one of the sources has some issues every now and then and we don't recover properly.

Another issue is that we are using actual data from the internet and spamming real servers for no reason (we should use mock data in tests).

I propose two improvements that we can do here, and I think we should do both:

add timeout to our request. I don't know why they don't fail on their own (maybe library doesn't support that?).

improves app functionality
probably fixes tests (because we have multiple sources), at least they will not get stuck, but we will still spam real servers when we run tests

don't fetch data from the internet during tests (use mock server)

fixes our tests
prevents us from unnecessarily spamming real servers
doesn't improve our app (actual app can still get stuck). It might not be that bad as it happens only once and only in fallback case (at least to my understanding)

Nick Gheorghita · Answer 1 · Thu Feb 01 2024 22:48:31 GMT+0800 (China Standard Time)

Just to add a bit of context - #1099 (comment)

I do agree that implementing both solutions is ideal, but I'd start with solution 1. iiuc, solution 2 will be somewhat involved since we have to mock http requests to many servers. Maybe there's a way around this? But yea, the fact that the test is failing again so suddenly after being fixed is a strong indication that this tech debt needs addressing