[ci] [R-package] macOS clang CMake jobs failing with segfaults
jameslamb opened this issue · comments
Description
The r-package (macos-13, clang, R 4.3, cmake)
CI jobs are failing with a segfault like this:
* checking examples ... ERROR
Running examples in ‘lightgbm-Ex.R’ failed
The error most likely occurred in:
> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: lgb.cv
> ### Title: Main CV logic for LightGBM
> ### Aliases: lgb.cv
>
> ### ** Examples
>
> ## No test:
> ## Don't show:
> setLGBMthreads(2L)
> ## End(Don't show)
> ## Don't show:
> data.table::setDTthreads(1L)
> ## End(Don't show)
> data(agaricus.train, package = "lightgbm")
*** caught segfault ***
address 0x540, cause 'memory not mapped'
Traceback:
1: load(zfile, envir = tmp_env)
2: data(agaricus.train, package = "lightgbm")
An irrecoverable exception occurred. R is aborting now ...
* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ...
Running ‘testthat.R’/Library/Frameworks/R.framework/Resources/bin/BATCH: line 60: 12462 Segmentation fault: 11 ${R_HOME}/bin/R -f ${in} ${opts} ${R_BATCH_OPTIONS} > ${out} 2>&1
[20s/12s]
[20s/12s] ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
*** caught segfault ***
address 0x7fb41c[2000](https://github.com/microsoft/LightGBM/actions/runs/10581950637/job/29392431659?pr=6625#step:9:2001)34, cause 'memory not mapped'
Warning: stack imbalance in 'lazyLoadDBfetch', 15 then 17
Warning: stack imbalance in 'c', 40 then 38
Warning: stack imbalance in 'lapply', 16 then 17
Traceback:
1: (function () expr)()
2: test_files_serial(test_dir = test_dir, test_package = test_package, test_paths = test_paths, load_helpers = load_helpers, reporter = reporter, env = env, stop_on_failure = stop_on_failure, stop_on_warning = stop_on_warning, desc = desc, load_package = load_package, error_call = error_call)
3: test_files(test_dir = path, test_paths = test_paths, test_package = package, reporter = reporter, load_helpers = load_helpers, env = env, stop_on_failure = stop_on_failure, stop_on_warning = stop_on_warning, load_package = load_package, parallel = parallel)
4: test_dir("testthat", package = package, reporter = reporter, ..., load_package = "installed")
5: test_check(package = "lightgbm", stop_on_failure = TRUE, stop_on_warning = FALSE, reporter = testthat::SummaryReporter$new())
An irrecoverable exception occurred. R is aborting now ...
Execution halted
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in ‘inst/doc’ ... OK
sh: line 1: 12492 Segmentation fault: 11 R_LIBS=/var/folders/zf/5wcpcvh91p9g4jn5srhsmzv40000gn/T//Rtmpr6e2Sh/RLIBS_29c22c8c2289 R_ENVIRON_USER='' R_LIBS_USER='NULL' R_LIBS_SITE='NULL' '/Library/Frameworks/R.framework/Resources/bin/R' --vanilla --no-echo > '/Users/runner/work/LightGBM/LightGBM/lightgbm.Rcheck/build_vignettes.log' 2>&1 < '/var/folders/zf/5wcpcvh91p9g4jn5srhsmzv40000gn/T//Rtmpr6e2Sh/file29c23693a3f9'
* checking re-building of vignette outputs ... ERROR
Error(s) in re-building vignettes:
...
--- re-building ‘basic_walkthrough.Rmd’ using knitr
Reproducible example
This is happening on all PRs. For example, see this build from #6625: https://github.com/microsoft/LightGBM/actions/runs/10581950637/job/29392431659?pr=6625.
On that PR, I manually re-triggered that job 3 times over the last 24 hours.
Additional Comments
It's worth noting that:
- all other R jobs are passing (including multiple on macOS, across multiple R versions)
- all non-R CI jobs (including swig, Python, and C++ tests) are passing, many also with CMake + clang
- this job uses a fixed version of R (4.3.1) ... new changes in R-devel couldn't cause this
- there are 0 issues reported in the CRAN checks on
{lightgbm}
v4.5.0: https://cran.r-project.org/web/checks/check_results_lightgbm.html
I just tried re-running ALL R jobs on #6625, let's see if any others fail: https://github.com/microsoft/LightGBM/actions/runs/10581950637?pr=6625
I strongly suspect this is related to a release of one of {lightgbm}
's dependencies. I looked through the list from CI logs and checked their releases on CRAN, here's what I found:
{brio}
= 2024-04-24{callr}
= 2024-03-25{cli}
= 2024-06-21{commonmark}
= 2024-01-30{crayon}
= 2024-06-20{data.table}
: 2024-08-27 (yesterday){desc}
= 2023-12-10{diffobj}
= 2021-10-05{digest}
= 2024-08-19{evaluate}
= 2024-06-10{fansi}
= 2023-12-08{fs}
= 2024-04-25{glue}
= 2024-01-09{highr}
= 2024-05-26{jsonlite}
= 2023-12-04{knitr}
= 2024-07-07{lifecycle}
= 2023-11-07{magrittr}
= 2022-03-30{markdown}
= 2024-06-04{Matrix}
= 2024-04-26{pillar}
= 2023-03-22{pkgbuild}
= 2024-03-17{pkgconfig}
= 2019-09-22{pkgload}
= 2024-06-28{praise}
= 2015-08-11{processx}
= 2024-03-16{ps}
= 2024-07-02{R6}
= 2021-08-19{rematch2}
= 2020-05-01{RhpcBLASctl}
= 2023-02-11{rlang}
= 2024-06-04{rprojroot}
= 2023-11-05{testthat}
= 2024-04-14{tibble}
= 2023-03-20{utf8}
= 2023-10-22{vctrs}
= 2023-12-01{waldo}
= 2024-08-23{withr}
= 2024-07-31{xfun}
= 2024-08-17{yaml}
2024-07-26
So I think the new {data.table}
release is a suspect. And experience tells me it's probably that release + something related to OpenMP 😭
On my Mac (M2, Sonoma 14.4.1), I built the latest {lightgbm}
(fde0157) from source.
Rscript build_r.R --no-build-vignettes -j4
Found that, in combination with the latest {data.table}
, the following is enough to reproduce the segfault.
cat > test.R <<EOF
library(lightgbm)
data(agaricus.train, package = "lightgbm")
lgb.Dataset(
data = agaricus.train\$data
, label = agaricus.train\$label
)\$construct()
EOF
# fails
Rscript test.R
The error does not occur if I disable OpenMP parallelism.
# succeeds
OMP_NUM_THREADS=1 Rscript test.R
Downgrading to the prior release of {data.table}
also resolves it.
Rscript -e "remove.packages('data.table')"
Rscript --vanilla -e "install.packages(c('https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.15.4.tar.gz'), repos = NULL)"
# succeeds
Rscript test.R
# also succeeds
OMP_NUM_THREADS=1 Rscript test.R
So it does look like it's something related to the latest {data.table}
release. And since this is only happening on macOS, with clang, for CMake-based builds, I suspect it's related to the changes from #6391 and #6489 as well.
I noticed that when I build {data.table}
1.15.4 from source, it isn't passing OpenMP flags.
Building 1.16.0 from source, it does. I see lines like this:
clang -arch arm64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/opt/R/arm64/include -I/opt/homebrew/opt/libomp/include -Xclang -fopenmp -DNOZLIB -fPIC -falign-functions=64 -Wall -g -O2 -c wrappers.c -o wrappers.o
It looks like {data.table}
's shared library is linking to R's OpenMP.
R_LIB=/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
otool -L "${R_LIB}/data.table/libs/data_table.so"
data_table.so (compatibility version 0.0.0, current version 0.0.0)
/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libomp.dylib (compatibility version 5.0.0, current version 5.0.0)
/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libR.dylib (compatibility version 4.3.0, current version 4.3.3)
/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 2420.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1345.100.2)
{lightgbm}
has an RPATH entry
otool -L "${R_LIB}/lightgbm/libs/lightgbm.so"
@rpath/lightgbm.so (compatibility version 0.0.0, current version 0.0.0)
@rpath/libomp.dylib (compatibility version 5.0.0, current version 5.0.0)
/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libR.dylib (compatibility version 4.3.0, current version 4.3.3)
/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1700.255.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1345.100.2)
Which will search in this order:
Lines 829 to 833 in fde0157
So I suspect that this is our old friend, the "multiple versions of OpenMP loaded in the same session" problem.
Things we could try:
- making
{data.table}
aDepends
dependency of{lightgbm}
, so it'll be loaded first (and then{lightgbm}
will just find that via its@rpath/libomp.dylib
entry) - ensuring that R's library directories are added to the list of libomp RPATH entries for CMake-based R builds
- ensuring that R's library directories are added to e.g.
CMAKE_PREFIX_PATH
(docs), so thatfind_package()
/find_library()
will check there
Adding a Depends
entry with {data.table}
in DESCRIPTION
did not solve this.
But I found that adding R's main library directory at the beginning of the OpenMP RPATH list did! 🎉
Opened #6629 proposing that.
Summary
- R for macOS (from CRAN) vendors
libomp.dylib
- CRAN's pre-compiled binaries for macOS embed an absolute-path install name pointing at that vendored library when compiled with
-fopenmp
- CMake-based builds of
{lightgbm}
do not find thatlibomp.dylib
at build time
- because they use CMake's
find_package()
, and R ships just the library, not CMake config files and possibly not even headers
{data.table}
's newest release, v1.16.0, fixes its OpenMP detection and now CRAN's macOS binaries of that library load the R-vendoredlibomp.dylib
at runtime- macOS CMake builds of
{lightgbm}
use RPATH-based search forlibomp.dylib
... and R's library directory is not included in its list library(data.table)
andlibrary(lightgbm)
therefore load 2 differentlibomp.dylib
into the process, leading to segfaults 🙃
Impact
Building {lightgbm}
on macOS with clang, with OpenMP support enabled, from source using Rscript build_r.R
, will probably generate a package that immediately encounters segfaults at runtime if used together with {data.table} >= 1.16.0
. Upgrading to a version which contains the changes in #6629 fixes that.
The {lightgbm}
distributed via CRAN is unaffected (it uses autotools).
Windows and Linux users are unaffected.
Building with gcc
is unaffected.
Just for awareness, tagging some folks who might be interested (no action required.... this is a LightGBM problem, not a {data.table}
problem): @hcho3 @kevinushey @MichaelChirico
Messy! Glad you've found a fix. Linking our recent updates about configuring OpenMP on macOS since they're probably related & I don't see them here yet:
Rdatatable/data.table#6034
Rdatatable/data.table#6283
Rdatatable/data.table#6418
#6418 is in dev only, but we'll probably put it in a patch release soon.