Tests require 'make install' first

Question

Tests require 'make install' first

ferrouswheel opened this issue 5 years ago · comments

It should be possible to test opencog before potentially polluting the system libraries with a broken build.

I cleaned out my system libraries, and did a clean build of cogutil, atomspace, and opencog. Cogutil and Atomspace can pass tests without make installing first. OpenCog repo can't...

For OpenCog these tests fail if make install is not run:

The following tests FAILED:
	 12 - ShellUTest (Failed)
	 17 - AnaphoraTest (Child aborted)
	 19 - SuRealUTest (Failed)
	 20 - MicroplanningUTest (Failed)
	 22 - PLNRulesUTest (Failed)
	 23 - OpenPsiRulesUTest (Failed)
	 24 - OpenPsiImplicatorUTest (Failed)
	 25 - OpenPsiSCMUTest (Failed)
	 30 - OpenPsiTest (Failed)
	 35 - MinerUTest (Failed)
	 36 - SurprisingnessUTest (SEGFAULT)
	 37 - GhostSyntaxUTest (Failed)
	 38 - GhostProcedureUTest (Failed)
	 39 - GhostUTest (Failed)

vs running make test after make install:

The following tests FAILED:
	 17 - AnaphoraTest (Child aborted)
	 30 - OpenPsiTest (Failed)

My main concern is that this will make it difficult to ensure we are testing the current build vs whatever is in /usr/local at the time.

Linas Vepštas · Answer 1 · Mon Jun 10 2019 09:57:45 GMT+0800 (China Standard Time)

AnaphoraTest

Huh. This is passing just fine for me. I fixed it just a few days ago ...!

JP · Answer 2 · Mon Jun 10 2019 10:04:20 GMT+0800 (China Standard Time)

The subtests in AnaphoraTest that were failing now pass for me ok.

But AnaphoraTest aborts at: "Testing the propose function ... Too many root sets".

Linas Vepštas · Answer 3 · Mon Jun 10 2019 10:35:08 GMT+0800 (China Standard Time)

Too many root sets

Ugh. That is a garbage collector limitation. It can be gotten around by recompiling the garbage collector to use the "huge memory model" instead of the default "large memory model". But I assume you are using whatever apt-get install provided. Yuck. Oh well. (It's a googlable error message, if you're curious).

(I'm using guile-2.9.2 not guile-2.2; it seems faster. I also manually set up the "huge memory model" cause my datasets are .. huge. Well, given Moore's law, not as huge as they used to be... )

JP · Answer 4 · Mon Jun 10 2019 10:55:23 GMT+0800 (China Standard Time)

I installed guile-2.2.4 from source but didn't configure it in any special way. I tried googling but couldn't find anything obvious on how set up guile to use a "huge memory model".

Can you clarify if the GC is part of guile, or if guile makes use of some external GC library I have to mess with? I didn't see anything in guile's configure script help.

The CI build uses 2.2.3, so that should probably also be updated as AnaphoraTest gets the same error there. Should we bump the requirements on guile or add some instructions on how to handle this?

Linas Vepštas · Answer 5 · Mon Jun 10 2019 11:01:07 GMT+0800 (China Standard Time)

GC is not a part of guile, its Boehm-GC - on debian its the libgc-dev and libgc1c2 dpkg's. It's a prereq for guile, and most other things that use GC, except for Java, which has it's own thing going. The source is here: https://www.hboehm.info/gc/ --- https://en.wikipedia.org/wiki/Boehm_garbage_collector --- https://github.com/ivmai/bdwgc

Linas Vepštas · Answer 6 · Mon Jun 10 2019 11:07:54 GMT+0800 (China Standard Time)

when I last built it a few years ago, I did it by saying ./configure --enable-large-config

JP · Answer 7 · Mon Jun 10 2019 11:09:02 GMT+0800 (China Standard Time)

related ivmai/bdwgc#83

So I guess the question is: is there any way to make AnaphoraTest pass, in a sensible way, without a custom bdwgc build? (I know nothing about how it works currently, so have no idea if it's just the nature of the problem or not)

Linas Vepštas · Answer 8 · Mon Jun 10 2019 11:18:24 GMT+0800 (China Standard Time)

Well, that's interesting! The code in Anaphora looks entirely reasonable, and should not put any harsh demands on the system. Certainly, the quantity and complexity of the scheme code in subsystems like ghost and relex2logic and pln is far greater. My current guess is that the mixture of python and guile has to do with it ... the GC is having to crawl over the memory managed by python, maybe it's discovering tens of thousands of smart-pointers in python, and overflowing some array. Why we hit it here, and not in some Hanson Robotics ROS+ghost stack, I don't know. Might be which library got initialized first. When guile initializes, is snapshots the C stack and etc. precisely so that it knows what RAM its supposed to manage, and what to leave alone.

Linas Vepštas · Answer 9 · Mon Jun 10 2019 11:21:41 GMT+0800 (China Standard Time)

the question is: is there any way to make AnaphoraTest pass, in a sensible way, without a custom bdwgc build?

Not that I know of.

Linas Vepštas · Answer 10 · Wed Jun 19 2019 13:38:28 GMT+0800 (China Standard Time)

See also #2088

JP · Answer 11 · Mon Jun 24 2019 11:17:00 GMT+0800 (China Standard Time)

I was mistaken about AtomSpace. I had to make install to get all AtomSpace tests working too.

My cleaning script nukes these directories:

clean() {
    sudo rm -r /home/$USER/.cache/guile/ccache
    sudo rm -r /usr/local/lib/opencog/
    sudo rm -r /usr/local/lib/python3/dist-packages/opencog
    sudo rm -r /usr/local/lib/python3.5/dist-packages/opencog
    sudo rm -r /usr/local/lib/python3.6/dist-packages/opencog
    sudo rm -r /usr/local/share/opencog
    sudo rm -r /usr/local/share/guile/site/2.2/opencog*
    sudo rm -r ~/.virtualenvs/opencog/lib/python3.6/site-packages/opencog
}
...

After that, all I do is:

run make install inside the cogutil build dir
then make a new build dir for the atomspace
cd build && make && make test and boom 41 failures.
sudo make install && make test and boom, all successes.

Strangely the circleci build doesn't need to do make install. But I was unable to determine what the opencog-deps docker image does differently to avoid running make install. I tried copying it's ~/.guile config:

; Add path to OpenCog modules
(add-to-load-path "/usr/local/share/opencog/scm")

; Add present directory
(add-to-load-path ".")

; To make working with arrow keys easier
(use-modules (ice-9 readline))
(activate-readline)

; Enable showing of backtrace on error
(debug-enable 'backtrace)

; Record positions of source code expressions.
(read-enable 'positions)

But this had no effect. This hidden config in the build image isn't confidence inspiring. If such environmental config is necessary then it should be in the circleci build config or in cmake. Given I only discovered this file by accident, I have low confidence there isn't some magic configuration that I'm missing.

Linas Vepštas · Answer 12 · Mon Jun 24 2019 11:37:40 GMT+0800 (China Standard Time)

(add-to-load-path "/usr/local/share/opencog/scm")

This does nothing; its an obsolete path, nothing is there any more. It should be removed from code and documentation.

boom 41 failures

This is surprising; the scheme infrastructure explicitly adds these lines:

./guile/SchemeSmob.cc:	scm_c_eval_string("(add-to-load-path \"" PROJECT_SOURCE_DIR "/opencog/scm\")");
./guile/SchemeSmob.cc:	scm_c_eval_string("(add-to-load-path \"" PROJECT_BINARY_DIR "\")");

They are there so that unit tests can pass without the install. I think this is a horribly hacky way of making unit tests pass without install .. but whatever. This should have been enough. Perhaps your CMake is setting PROJECT_SOURCE_DIR or PROJECT_BINARY_DIR to some unexpcected locations?

Linas Vepštas · Answer 13 · Mon Jun 24 2019 11:39:59 GMT+0800 (China Standard Time)

Also I think the ~/.guile file is used only if you run the REPL shell; otherwise it would be ignored. I think there is only one unit test that uses the REPL shell.

Nil Geisweiller · Answer 14 · Mon Jun 24 2019 12:49:39 GMT+0800 (China Standard Time)

@ferrouswheel you probably should add /usr/local/include/opencog to your clean function. Here's my script BTW https://github.com/ngeiswei/ocbld that I don't especially share cause it's somewhat personalized.

JP · Answer 15 · Mon Jun 24 2019 13:38:59 GMT+0800 (China Standard Time)

Thanks folks.

I got annoyed with the dependencies of the current CI system, so I'm building a cleanroom docker image that doesn't do any magic or rely on weird adhoc scripts. all dependencies/modifications from stock ubuntu will be right there in the dockerfile and I'll be able to build from a clean image whenever I like.

With this image I was able to reproduce the test failures when make install isn't run first. I'll post more tomorrow as I figure out where the problem/difference is.

Vitaly Bogdanov · Answer 16 · Mon Jun 24 2019 17:43:11 GMT+0800 (China Standard Time)

@ferrouswheel , one of the problems could be the fact that LD_LIBRARY_PATH doesn't override shared libraries loading path for giule. For the first build on clean system guile before make install loads libraries from build directory. But if your did make install, checkout another branch, did make and make test then guile will still load shared libraries from INSTALL_PREFIX not from build directory.

The reason is that guile uses libltdl library and libltdl has its own LTDL_LIBRARY_PATH to override search path. I am not sure whether it is issue or expected behaviour of libltdl but atomspace and opencog builds suffer from this behavior.

JP · Answer 17 · Tue Jun 25 2019 08:37:16 GMT+0800 (China Standard Time)

Thanks @vsbogd - that was certainly part of it.

I picked a random failing test to focus on, in this case tests/query/PresentLinkUTest.

After tracing with LD_DEBUG, I've discovered part of the problem is due to libexec.so dynamic library of ExecutionLink not being in the LTDL_LIBRARY_PATH...

However this fails in a special way, because libexec.so is also system library 😭 . So it pretends to load but doesn't define any of the expected scheme variables. To avoid this we should probably rename our libraries to not have the same names as common system libraries! I'm mistaken, see a couple of comments below.

Inspecting the test, it has a sensible RUNPATH:

~/src/atomspace/build/tests $ objdump -x query/PresentUTest | grep RUNPATH
  RUNPATH              /home/joel/work/opencog/atomspace/build/opencog/atomspace:/home/joel/work/opencog/atomspace/build/opencog/query:/home/joel/work/opencog/atomspace/build/opencog/util:/home/joel/work/opencog/atomspace/build/opencog/guile/modules:/home/joel/work/opencog/atomspace/build/opencog/ure:/home/joel/work/opencog/atomspace/build/opencog/atomspaceutils:/home/joel/work/opencog/atomspace/build/opencog/unify:/home/joel/work/opencog/atomspace/build/opencog/guile:/home/joel/work/opencog/atomspace/build/opencog/atoms/pattern:/home/joel/work/opencog/atomspace/build/opencog/atoms/execution:/home/joel/work/opencog/atomspace/build/opencog/atoms/reduct:/home/joel/work/opencog/atomspace/build/opencog/cython:/home/joel/work/opencog/atomspace/build/opencog/atoms/core:/home/joel/work/opencog/atomspace/build/opencog/atoms/base:/home/joel/work/opencog/atomspace/build/opencog/atoms/truthvalue:/home/joel/work/opencog/atomspace/build/opencog/atoms/value:/home/joel/work/opencog/atomspace/build/opencog/atoms/atom_types:/usr/local/lib:/usr/local/lib/opencog

Ideally I'd grab that string and set this as the LTDL_LIBRARY_PATH environment variable for any tests that need it. However I haven't figured out how to get that information from a cmake target before it gets built (none of the rpath variables mentioned here were helpful)

So what I'm currently thinking is that I'll create a new ADD_GUILE_TEST macro that manually sets these runpaths. OpenCog repo would also have one with different build paths.

JP · Answer 18 · Tue Jun 25 2019 08:38:14 GMT+0800 (China Standard Time)

Actually I just had a thought that the reason the rpath is empty might be that our cmake test targets shadow the executable target (they have the same name). I'll explore some more...

Linas Vepštas · Answer 19 · Tue Jun 25 2019 08:53:09 GMT+0800 (China Standard Time)

because libexec.so is also system library

?

which one ? (none-such on ubuntu/debian)

dpkg -S libexec
dpkg-query: no path found matching pattern *libexec*

and

sudo ldconfig -p | grep libexe
	libexempi.so.3 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libexempi.so.3

anyway, if it conflicts with some other lib from some other package, we can rename it.

JP · Answer 20 · Tue Jun 25 2019 08:59:22 GMT+0800 (China Standard Time)

Oops, my mistake. I misinterpreted the LD_DEBUG output and then got confused with the libexec executable that seems to be associated with clang/llvm. Sorry.

dpkg -S libexec shows me ~30 matches on Ubuntu 18.04.

Linas Vepštas · Answer 21 · Tue Jun 25 2019 09:00:16 GMT+0800 (China Standard Time)

ok, well libexec is a terrible name anyway.

JP · Answer 22 · Tue Jun 25 2019 11:49:29 GMT+0800 (China Standard Time)

I went back to the Ubuntu 16.04 image and tests ran successfully without make install

Inspecting library loading with LD_DEBUG I found it used RPATH instead of RUNPATH

RUNPATH and RPATH are not equivalent in terms of transitive library loading - someone changed the default and this must have propagated in Ubuntu 18.04.

To restore the old behaviour one can specify -Wl,--disable-new-dtags to the linker.

Some other related resources:

JP · Answer 23 · Tue Jun 25 2019 12:23:56 GMT+0800 (China Standard Time)

One specific thing to note, mentioned in the Qt blog reference, is that setting an LD_LIBRARY_PATH overrides any RUNPATH which is the reverse of how RPATH worked (RPATH takes priority over LD_LIBRARY_PATH).

I'm not sure if opencog.scm always setting a LTDL_LIBRARY_PATH is similarly short circuiting the RUNPATH of libsmob.so. I'll need to do some experiments when I come back to this.

Linas Vepštas · Answer 24 · Wed Jun 26 2019 02:17:20 GMT+0800 (China Standard Time)

The setting of LTDL_LIBRARY_PATH in opencog.scm is a kind-of-ish hack, meant only to allow the guile loader to find the needed shlibs. Between RUNPATH, RPATH, the two LIBRARY_PATH's, and several choices for where to install the libraries, it's hard to see what the best solution is. A decade ago, it seemed like there was a clear-cut answer, with standards committees e.g. the LSB telling you exactly how how to be conformant. Since then, it feels like different distros each went their own way, major subsystems invented their own clashing policies and so... Beats me, I can't keep track. There is at least one distro which tries to avoid this by installing each and every app in it's own fenced-off playground: viz. "nix" .. we have https://github.com/opencog/opencog-nix for it. Also https://en.wikipedia.org/wiki/Nix_package_manager I have not actually tried it.

JP · Answer 25 · Fri Jun 28 2019 06:52:00 GMT+0800 (China Standard Time)

So I've come up with what I consider a more reliable system that gets rid of LTDL_LIBRARY_PATH and follows what Guile suggests here. i.e. using an explicit path. This path is generated by CMake and stored in generated guile module (opencog as-config) - the config file is configured for both the build directory, and when running make install it is configured for the install prefix. This means no more guessing and library loading works with the modern RUNPATH tag.

It also entirely gets rid of the issue that @vsbogd has mentioned with LTDL_LIBRARY_PATH, since it's not used.

The config is called as-config because it only provides paths for the atomspace. OpenCog repo will have to generate it's own config at build time, and I was planning to call that oc-config. These are internal modules that are not intended to be seen by users, but if atomspace moves to it's own namespace one day, then they can both be named config to make it more consistent.

I've also made these config files abort if a env var indicates testing is in progress, but the config is being loaded from the system path. This has highlighted where some of our tests are misconfigured and will use whatever is install in the system dirs in preference to the build dir.

I've also discovered that some tests will test scm files that are in the cmake build dir, and some will test scm files from the source dir. I can see this may be nice for editing scm files in-place, and rerunning test executables without having to run make... but this is another layer of confusion about what code is actually being tested. I'm not sure if I should allow this use case or not (it's not difficult to do so, but adds cognitive load to figuring which tests use which version of which code... I'd personally prefer certainty over convenience).

I'll make a PR soon to show want I'm suggesting.

JP · Answer 26 · Fri Jun 28 2019 11:12:49 GMT+0800 (China Standard Time)

See PR opencog/atomspace#2238 - if this is acceptable I can implement the same for OpenCog.

Linas Vepštas · Answer 27 · Sat Jun 29 2019 03:03:54 GMT+0800 (China Standard Time)

Well, I have two primary complaints about opencog/atomspace#2238 -- one is that it is trying to over-ride default decisions already made in OpenCogGuile.cmake -- if the defaults need to be changed, it would be better to just change them in OpenCogGuile.cmake

The other would be a more principled development -- so, some of the CMakefile already copy scm files into the build dir, precisely so that unit tests can run without a prior install. Maybe the failing tests are exactly the ones that are not doing this copy. Either all CMakes should perform the copy, or none of them should -- it should not be half-n-half.

The other hack is that many/most of the unit tests contain lines like scm_c_eval_string("(add-to-load-path \"../../..\")"); because they are trying to guess where the source dir is, so that the needed scm files are found. That path-guessing is both hacky, and fragile -- it keeps breaking every so often, and needs adjustment as files move around. So the right place to start would be to bulk-remove all the scm_c_eval_string("(add-to-load-path \"../../..\")"); hackery, and then move forward.

I'm thinking the right fix is to twiddle OpenCogGuile.cmake so that the library paths are set up correctly .. viz .. OpenCogGuile.cmake is presumabely incomplete or buggy -- the LTDL mechanism is older than OpenCogGuile.cmake, and when this cmakefile was introduced, the LTDL mechanism was never updated/removed.

JP · Answer 28 · Sat Jun 29 2019 03:24:53 GMT+0800 (China Standard Time)

I agree with a lot of what you're saying, and considered using OpenCogGuile.cmake. However, I was struggling to remember how to maintain cmake state across directories such that I could fill in the template. It would definitely be preferable to define the module and have cmake do the work, and I can work on improving that. Another part is that OpenCogGuile.cmake, as I remember it, only deals with scm files right now. Guile extensions would need a new macro to define them, or the scm module definition would have to explicitly indicate what extension they will want to load come runtime. I just woke up and it's the weekend, so I'll respond to the PR in more detail later.

…

On Sat, 29 Jun 2019, 7:03 AM Linas Vepštas, ***@***.***> wrote: Well, I have two primary complaints about opencog/atomspace#2238 <opencog/atomspace#2238> -- one is that it is trying to over-ride default decisions already made in OpenCogGuile.cmake -- if the defaults need to be changed, it would be better to just change them in OpenCogGuile.cmake The other would be a more principled development -- so, some of the CMakefile already copy scm files into the build dir, precisely so that unit tests can run without a prior install. Maybe the failing tests are exactly the ones that are not doing this copy. Either *all* CMakes should perform the copy, or *none* of them should -- it should not be half-n-half. The other hack is that many/most of the unit tests contain lines like scm_c_eval_string("(add-to-load-path \"../../..\")"); because they are trying to guess where the source dir is, so that the needed scm files are found. That path-guessing is both hacky, and fragile -- it keeps breaking every so often, and needs adjustment as files move around. So the right place to start would be to bulk-remove all the scm_c_eval_string("(add-to-load-path \"../../..\")"); hackery, and then move forward. I'm thinking the right fix is to twiddle OpenCogGuile.cmake so that the library paths are set up correctly .. viz .. OpenCogGuile.cmake is presumabely incomplete or buggy -- the LTDL mechanism is older than OpenCogGuile.cmake, and when this cmakefile was introduced, the LTDL mechanism was never updated/removed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3529?email_source=notifications&email_token=AAA5MBZ55L6A23DQ6JP3F7LP4ZOBXA5CNFSM4HWPZZMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY25NAQ#issuecomment-506844802>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA5MB5MDIOUP3CW4AHM35DP4ZOBXANCNFSM4HWPZZMA> .