Error taxonomy: detail errors in "spack_error"
alalazo opened this issue · comments
I started looking at failures in the last week, and I have some proposal for extending the taxonomy:
# Spack timed out while waiting for a lock
lock_timeout:
grep_for:
- "Timed out waiting for a write lock after"
# A file was not found during installation
file_not_found:
grep_for:
- "FileNotFoundError"
# A command on the system exited non-zero
system_command_failure:
grep_for:
- "ProcessError: Command exited with status"
# Couldn't fetch an archive to build a spec
fetch_error:
grep_for:
- "FetchError:"
# Some method in a build_system base class failed
build_systems:
grep_for:
- "File \"/builds/spack/spack/lib/spack/spack/build_systems"
# Some attribute in a recipe was not found
attribute_error:
grep_for:
- "AttributeError:"
# Relocation refused to grow a string from buildcache
cannot_grow_string:
grep_for:
- "CannotGrowString"
I also noticed a few socket.timeout
but I see that those are already taken care of in #439
Notes
- The lock timeout, so far, seems to happen only on POWER
- I have seen a few cases of CMake returning -4 Does that have a particular meaning?
I will work on these. In the meantime, can you send the failed jobs that you found these in? That way I can make sure the classification order and regex's are working correctly.
To get the categorization above, I checked the raw logs for the all the "spack_error" of last week. Posting one link per refined category:
- lock_timeout: https://gitlab.spack.io/spack/spack/-/jobs/6344522
- file_not_found: https://gitlab.spack.io/spack/spack/-/jobs/6374147
- system_command_failure: https://gitlab.spack.io/spack/spack/-/jobs/6366035
- fetch_error: https://gitlab.spack.io/spack/spack/-/jobs/6312534
- build_systems: https://gitlab.spack.io/spack/spack/-/jobs/6336481
- attribute_error: https://gitlab.spack.io/spack/spack/-/jobs/6322911
- cannot_grow_string: https://gitlab.spack.io/spack/spack/-/jobs/6317754
@alalazo do you have numbers on the frequency of each of these?
Number of logs in each category (out of ~500 logs downloaded):
- lock_timeout: 210 (165 on POWER)
- file_not_found: 18
- system_command_failure: 171
- fetch_error: 37
- build_systems: 6
- attribute_error: 12
- cannot_grow_string: 29
Also, forgot to mention:
- os_error (grep for "OSError", got 4 of these): https://gitlab.spack.io/spack/spack/-/jobs/6309020
So, looking at system command failures. The one I linked seems to be a case of scheduling to the wrong node, so (thanks @haampie for the suggestion) most of these cases there might be illegal instructions.1
For https://gitlab.spack.io/spack/spack/-/jobs/6366035 specifically I wonder:
- Why are we having a
skylake_avx512
buildcache? Was there some transient error in spack/spack#36045 that caused that? - Why are we mapping an AVX512 build to a
zen2
(I think) node?
Footnotes
-
Hopefully all of them, since the other hypothesis is errors in relocation. ↩
I fixed that here: spack/spack@dba57ff