adaptivecomputing / torque

Torque Repository

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pbs_server segmentation fault triggered by qdel

widyono-cets opened this issue · comments

os_release
openSUSE Leap 42.2

cpu_info
model name : Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc

pbs_server --version
Version: 6.1.1
Commit: 4ccb078

./configure --with-rcp=scp --with-tcl --prefix=/opt/torque/6.1.1-gcc --with-server-home=/var/spool/torque --with-pam=/lib64/security --enable-gui --enable-syslog

to reproduce:
submit simple job file
qdel the resulting job

what happens:
pbs_server coredumps

what is expected:
no coredump

trace:
Core was generated by `/opt/torque/6.1.1-gcc/sbin/pbs_server -F -d /var/spool/torque'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f5fc6d51490 in _xend () at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:33
33 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or directory.
[Current thread is 1 (Thread 0x7f5f667fc700 (LWP 24164))]
Missing separate debuginfos, use: zypper install libxml2-2-debuginfo-2.9.4-3.1.x86_64 libz1-debuginfo-1.2.8-10.1.x86_64
(gdb) bt
#0 0x00007f5fc6d51490 in _xend () at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:33
#1 __lll_unlock_elision (lock=0x7f5f34081b90, private=0) at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#2 0x000000000049e83e in unlock_ji_mutex (pjob=0x7f5f34082bb0, id=0x539398 <relay_to_mom(job**, batch_request*, void ()(work_task))::func> "relay_to_mom", msg=0x0, logging=7) at svr_jobfunc.c:4052
#3 0x000000000043337a in relay_to_mom (pjob_ptr=0x7f5f667f4eb0, request=0x7f5f667f3590, func=0x0) at issue_request.c:213
#4 0x000000000048fbac in issue_signal (pjob_ptr=0x7f5f667f4ef8, signame=0x54394f "SIGTERM", func=0x4690d8 <post_delete_mom1(batch_request*)>, extra=0x7f5f500830b0, extend=0x7f5f50083090 "0") at req_signal.c:272
#5 0x000000000046812c in execute_job_delete (pjob=0x7f5f34082bb0, Msg=0x0, preq=0x7f5f667f9040) at req_delete.c:696
#6 0x0000000000468ae3 in single_delete_work (preq=0x7f5f667f9040) at req_delete.c:987
#7 0x0000000000468c99 in handle_single_delete (preq=0x7f5f667f9040, Msg=0x0) at req_delete.c:1067
#8 0x0000000000468f0f in req_deletejob (preq=0x7f5f667f9040) at req_delete.c:1170
#9 0x00000000004638d5 in dispatch_request (sfds=9, request=0x7f5f667f9040) at process_request.c:801
#10 0x00000000004636b6 in process_request (chan=0x7f5f500008c0) at process_request.c:701
#11 0x00000000004bc451 in process_pbs_server_port (sock=9, is_scheduler_port=0, args=0x7f5fbc0009a0) at incoming_request.c:162
#12 0x00000000004bc66c in start_process_pbs_server_port (new_sock=0x7f5fbc0009a0) at incoming_request.c:270
#13 0x00000000004fd582 in work_thread (a=0x6d118e0) at u_threadpool.c:318
#14 0x00007f5fc6d47734 in start_thread (arg=0x7f5f667fc700) at pthread_create.c:334
#15 0x00007f5fc5fe6d3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

tangential comment:
I've confirmed that commit 8ab2669 is necessary on this system to fix TRQ-3841, otherwise pbs_server would segfault in that location instead when submitting multiple jobs

This should be fixed in 6.0-dev, 6.1-dev, and develop.

I applied patch to 6.1.1, confirmed working on OpenSuSE Leap 42.2 with glibc-2.22-3.7, thanks! I will not be able to test further since I've moved production cluster to a Fermi SL 7.3 system instead, which does not trigger the mutex unlock bugs. I spun up a quick local testbed on an OpenSuSE Leap 42.2 system just to make sure qdel no longer triggered the bug.

Thank you for confirming the fix.