near / nearcore

Reference client for NEAR Protocol

Home Page:https://near.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Do not panic upon receiving invalid state witness

walnut-the-cat opened this issue · comments

Relevant discussion: link

As of now, when a chunk validator receives an invalid state witness (e.g. missing contract code), it will panic and crash. We should change the behavior so it simply logs the issue but move on without panicking.

The backtrace from the linked zulip thread suggests that the panic happened inside <near_store::trie::trie_storage::TrieMemoryPartialStorage as near_store::trie::trie_storage::TrieStorage>::retrieve_raw_bytes

2024-04-16T20:21:23.545144Z DEBUG chunk_tracing{chunk_hash=HnFSQEoLMEnMXK2pxnnnbv7GkwFobanyrd7JJbNS2Rrj}:new_chunk{shard_id=3}:apply_chunk{shard_id=3}:process_state_update:apply{protocol_version=84 num_transactions=19}:process_receipt{receipt_id=GHhLncT5GM2ksuwVzUqPMkzCp132V7xToQZPfUbKeRgP predecessor=operator.meta-pool.near receiver=lockup-meta-pool.near id=GHhLncT5GM2ksuwVzUqPMkzCp132V7xToQZPfUbKeRgP}:run{code.hash=EXekfV3kpFHHsTi4JUDh2MVLCKS3hpKdPbXMuRirxrvY vm_kind=NearVm}: vm: close time.busy=49.3µs time.idle=3.42µs
thread '<unnamed>' panicked at core/store/src/trie/trie_storage.rs:317:16:
!!!CRASH!!!: MissingTrieValue(TrieMemoryPartialStorage, 5FWvfWAJxH1mbCHuzLGwBfL9EYjH8YWVin6Pmp3H8gdM)
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <near_store::trie::trie_storage::TrieMemoryPartialStorage as near_store::trie::trie_storage::TrieStorage>::retrieve_raw_bytes
   4: near_store::trie::Trie::internal_retrieve_trie_node
   5: near_store::trie::Trie::retrieve_raw_node
   6: near_store::trie::Trie::lookup_from_state_column
   7: near_store::trie::Trie::get_optimized_ref
   8: near_store::trie::Trie::get
   9: near_store::trie::update::TrieUpdate::get
  10: near_store::get_code
  11: node_runtime::actions::execute_function_call
  12: node_runtime::Runtime::apply_action
  13: node_runtime::Runtime::apply_action_receipt
  14: node_runtime::Runtime::apply::{{closure}}
  15: node_runtime::Runtime::apply
  16: <near_chain::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::apply_chunk
  17: near_chain::update_shard::apply_new_chunk
  18: core::ops::function::FnOnce::call_once{{vtable.shim}}
  19: <rayon_core::job::HeapJob<BODY> as rayon_core::job::Job>::execute
  20: rayon_core::registry::WorkerThread::wait_until_cold

But looking at master, there's no unwraps in this function 0_o:

impl TrieStorage for TrieMemoryPartialStorage {
fn retrieve_raw_bytes(&self, hash: &CryptoHash) -> Result<Arc<[u8]>, StorageError> {
let result =
self.recorded_storage.get(hash).cloned().ok_or(StorageError::MissingTrieValue(
MissingTrieValueContext::TrieMemoryPartialStorage,
*hash,
));
if result.is_ok() {
self.visited_nodes.borrow_mut().insert(*hash);
}
result
}

That's pretty confusing 0_O.

@staffik you mentioned that the code was at commit b62b6a, but I don't know where to find it, is it on some private branch?

This was commit from Alex PR: b62b6a3.
Maybe this unwrap was removed recently in master.

This was commit from Alex PR: b62b6a3.

Ah thanks for the link, now I see it 👍

Maybe this unwrap was removed recently in master.

I took a look at it's the same code as on master.
Mysterious 0_O

I see this line (called through validate_chunk_state_witness --> apply_new_chunk --> apply_chunk may also get to MissingTrieValue:

_ => panic!("{err}"),

Oh, we used tracing. It was: https://github.com/near/nearcore/pull/10843/files#diff-e073548a40d97af14f75cf143fab41a1cffe61d159e0b9a6297daeab0b2a5d45R317

Ahh ok, so the panic was caused by custom code that was added for debug purposes. The code on master doesn't have expect("!!!CRASH!!!"));, so there's nothing to fix there.

I see this line (called through validate_chunk_state_witness --> apply_new_chunk --> apply_chunk may also get to MissingTrieValue:

_ => panic!("{err}"),

Err(e) => match e {
    Error::StorageError(err) => match &err {
        StorageError::FlatStorageBlockNotSupported(_)
        | StorageError::MissingTrieValue(..) => Err(err.into()),
        _ => panic!("{err}"),
    },
    _ => Err(e),
},

This won't panic on MissingTrieValue, it'll return an error.
It could panic if it hits another StorageError that isn't accounted for, but I don't know if that's even possible 0_o

It seems that most of StorageError variants are fatal errors (corrupted database etc):

/// Errors which may occur during working with trie storages, storing
/// trie values (trie nodes and state values) by their hashes.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum StorageError {
    /// Key-value db internal failure
    StorageInternalError,
    /// Requested trie value by its hash which is missing in storage.
    MissingTrieValue(MissingTrieValueContext, CryptoHash),
    /// Found trie node which shouldn't be part of state. Raised during
    /// validation of state sync parts where incorrect node was passed.
    /// TODO (#8997): consider including hash of trie node.
    UnexpectedTrieValue,
    /// Either invalid state or key-value db is corrupted.
    /// For PartialStorage it cannot be corrupted.
    /// Error message is unreliable and for debugging purposes only. It's also probably ok to
    /// panic in every place that produces this error.
    /// We can check if db is corrupted by verifying everything in the state trie.
    StorageInconsistentState(String),
    /// Flat storage error, meaning that it doesn't support some block anymore.
    /// We guarantee that such block cannot become final, thus block processing
    /// must resume normally.
    FlatStorageBlockNotSupported(String),
    /// In-memory trie could not be loaded for some reason.
    MemTrieLoadingError(String),
}

Maybe we shouldn't panic on UnexpectedTrieValue, that looks like something that could be triggered by an invalid witness. But OTOH we have tests which check for this error and they don't trigger the panic, so it's probably a different code path. /cc @Longarithm

Idk, it doesn't feel very productive to read the code in hopes of finding of a possible panic. AFAIU the panic that spawned this issue can't happen on master, so there's nothing concrete to fix.

I remember that we wanted to fuzz the validation code, maybe that'd be a quicker way to find possible crashes in validation?
And good validation tests would ensure that the validation doesn't crash in the future, when the code changes.

Made an issue about fuzzing: #11132

Issue is not valid anymore. Closing it