spack / spack-infrastructure

Spack Kubernetes instance and services running there (GitLab, CDash, spack.io)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error taxonomy: detail errors in "spack_error"

alalazo opened this issue · comments

I started looking at failures in the last week, and I have some proposal for extending the taxonomy:

# Spack timed out while waiting for a lock
lock_timeout:
    grep_for:
    - "Timed out waiting for a write lock after"

# A file was not found during installation    
file_not_found:
    grep_for:
    - "FileNotFoundError"
    
# A command on the system exited non-zero
system_command_failure:
    grep_for:
    - "ProcessError: Command exited with status"
    
# Couldn't fetch an archive to build a spec
fetch_error:
    grep_for:
    - "FetchError:"

# Some method in a build_system base class failed
build_systems:
    grep_for:
    - "File \"/builds/spack/spack/lib/spack/spack/build_systems"

# Some attribute in a recipe was not found
attribute_error:
    grep_for:
    - "AttributeError:"

# Relocation refused to grow a string from buildcache
cannot_grow_string:
    grep_for:
    - "CannotGrowString"

I also noticed a few socket.timeout but I see that those are already taken care of in #439

Notes

  • The lock timeout, so far, seems to happen only on POWER
  • I have seen a few cases of CMake returning -4 Does that have a particular meaning?

I will work on these. In the meantime, can you send the failed jobs that you found these in? That way I can make sure the classification order and regex's are working correctly.

To get the categorization above, I checked the raw logs for the all the "spack_error" of last week. Posting one link per refined category:

@alalazo do you have numbers on the frequency of each of these?

Number of logs in each category (out of ~500 logs downloaded):

  • lock_timeout: 210 (165 on POWER)
  • file_not_found: 18
  • system_command_failure: 171
  • fetch_error: 37
  • build_systems: 6
  • attribute_error: 12
  • cannot_grow_string: 29

Also, forgot to mention:

So, looking at system command failures. The one I linked seems to be a case of scheduling to the wrong node, so (thanks @haampie for the suggestion) most of these cases there might be illegal instructions.1

For https://gitlab.spack.io/spack/spack/-/jobs/6366035 specifically I wonder:

  • Why are we having a skylake_avx512 buildcache? Was there some transient error in spack/spack#36045 that caused that?
  • Why are we mapping an AVX512 build to a zen2 (I think) node?

Footnotes

  1. Hopefully all of them, since the other hypothesis is errors in relocation.

I fixed that here: spack/spack@dba57ff