Is public archive or quickstart image for v20 good now for horizon ingestion?!

Question

Is public archive or quickstart image for v20 good now for horizon ingestion?!

jun0tpyrc opened this issue 8 months ago · comments

Juno Yu commented 8 months ago

What version are you using?

quickstart: 85a2c8b

What did you do?

use stellar:quickstart, only tuned two things

export HISTORY_RETENTION_COUNT=1500000
export PER_HOUR_RATE_LIMIT="0"

What did you expect to see?

nodes can get in sync

What did you see instead?

ingestion fails with looping - even pruning bucket/ bucket caches for horizon ->captive-core getting same results (multiple times)

time="2024-01-29T23:42:29.889Z" level=info msg="Processing ledger entry changes" pid=86 processed_entries=36500000 progress="83.95%" sequence=50141695 service=ingest source=historyArchive
time="2024-01-29T23:45:38.843Z" level=info msg="History: Catching up to ledger 50141759: Download & apply checkpoints: num checkpoints left to apply:0 (100% done)" pid=86 service=ingest s
ubservice=stellar-core
time="2024-01-29T23:45:43.082Z" level=info msg="History: Catching up to ledger 50142591: downloading ledger files 13/13 (100%)" pid=86 service=ingest subservice=stellar-core
time="2024-01-29T23:45:46.321Z" level=info msg="History: Catching up to ledger 50142591: Download & apply checkpoints: num checkpoints left to apply:13 (0% done)" pid=86 service=ingest su
bservice=stellar-core
time="2024-01-29T23:45:46.321Z" level=info msg="History: Catching up to ledger 50142591: Download & apply checkpoints: num checkpoints left to apply:13 (0% done)" pid=86 service=ingest su
bservice=stellar-core
time="2024-01-29T23:46:22.160Z" level=info msg="History: Catching up to ledger 50142591: Download & apply checkpoints: num checkpoints left to apply:12 (7% done)" pid=86 service=ingest su
bservice=stellar-core
...
(got broken at different boundary  and restart loop for processed_entries, for example)
...
time="2024-01-29T23:53:21.015Z" level=info msg="Processing ledger entry changes" pid=86 processed_entries=100000 progress="0.35%" sequence=50142591 service=ingest source=historyArchive

George · Answer 1 · Wed Jan 31 2024 05:58:52 GMT+0800 (China Standard Time)

Can you upload an unabridged chunk of the Horizon and Stellar Core logs, at least where the restart is occurring?

Juno Yu · Answer 2 · Wed Jan 31 2024 10:02:40 GMT+0800 (China Standard Time)

stellar-core would be in sync with head but horizon can't ingest

  "horizon_version": "horizon-v2.28.0-(built-from-source)",
  "core_version": "v20.1.0",
  "ingest_latest_ledger": 0,
  "history_latest_ledger": 0,
  "history_latest_ledger_closed_at": "0001-01-01T00:00:00Z",
  "history_elder_ledger": 0,
  "core_latest_ledger": 50158275,
  "network_passphrase": "Public Global Stellar Network ; September 2015",
  "current_protocol_version": 19,
  "supported_protocol_version": 20,
  "core_supported_protocol_version": 20

truncated logs excluding httpapi request logs ref attachment (grep -v method=GET), lines highlighting ingestion loop restarting ...

time="2024-01-30T14:37:05.751Z" level=info msg="Processing ledger entry changes" pid=218 processed_entries=2700000 progress="11.36%" sequence=50151423
...
time="2024-01-30T14:38:41.316Z" level=info msg="Processing ledger entry changes" pid=218 processed_entries=50000 progress="0.34%" sequence=50151551

example-logs-fail-sync-loop.txt

George · Answer 3 · Wed Jan 31 2024 11:27:15 GMT+0800 (China Standard Time)

Hmm... it's hard to debug because it starts off in an error state, but the logs suggest something might be up with the cache. Can you try the following? Stop Horizon, remove the bucket cache, and start it again:

supervisorctl stop horizon
rm -rf /opt/stellar/horizon/captive-core/bucket-cache/
supervisorctl start horizon

If this fixes it, it may be a bug with how caching works. To be more certain, could you provide another chunk of logs but leave more entries prior to the restart? Before trying the above, ideally.

Juno Yu · Answer 4 · Thu Feb 01 2024 08:33:59 GMT+0800 (China Standard Time)

Hmm... it's hard to debug because it starts off in an error state, but the logs suggest something might be up with the cache. Can you try the following? Stop Horizon, remove the bucket cache, and start it again:
supervisorctl stop horizon
rm -rf /opt/stellar/horizon/captive-core/bucket-cache/
supervisorctl start horizon
If this fixes it, it may be a bug with how caching works. To be more certain, could you provide another chunk of logs but leave more entries prior to the restart? Before trying the above, ideally.

We have tried multiple time , not only pruning /horizon/captive-core/bucket-cache/ but also the whole

bucket-cache  captive-core/buckets  stellar.db  stellar.db-shm  stellar.db-wal

folder structures of bucket/bucket-cache and horizoin db , never got it worked for few days

Juno Yu · Answer 5 · Thu Feb 01 2024 08:39:06 GMT+0800 (China Standard Time)

fyi impact may not be only fresh sync of v20 , I got teammate also reported (v19->v20) "upgraded node is stuck in a bucket download + ledger process loop" , which might be a similar issue

shawn · Answer 6 · Fri Feb 09 2024 06:19:48 GMT+0800 (China Standard Time)

@Shaptic , this can be closed now that stellar/go#5197 merged? looks like it's headed into upcoming horizon 2.28.2

George · Answer 7 · Fri Feb 09 2024 08:03:55 GMT+0800 (China Standard Time)

I think we can only close it once quickstart is released 👍 then @jun0tpyrc can reopen if the issue persists after upgrading

George · Answer 8 · Wed Feb 14 2024 05:37:11 GMT+0800 (China Standard Time)

This should be closed by #565, please reopen if not!

Juno Yu · Answer 9 · Wed Feb 14 2024 13:26:42 GMT+0800 (China Standard Time)

confirmed latest quickstart image is working for quick sync , thanks