opencog / opencog

A framework for integrated Artificial Intelligence & Artificial General Intelligence (AGI)

Home Page:http://wiki.opencog.org/w/Development

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tests require 'make install' first

ferrouswheel opened this issue · comments

commented

It should be possible to test opencog before potentially polluting the system libraries with a broken build.

I cleaned out my system libraries, and did a clean build of cogutil, atomspace, and opencog. Cogutil and Atomspace can pass tests without make installing first. OpenCog repo can't...

For OpenCog these tests fail if make install is not run:

The following tests FAILED:
	 12 - ShellUTest (Failed)
	 17 - AnaphoraTest (Child aborted)
	 19 - SuRealUTest (Failed)
	 20 - MicroplanningUTest (Failed)
	 22 - PLNRulesUTest (Failed)
	 23 - OpenPsiRulesUTest (Failed)
	 24 - OpenPsiImplicatorUTest (Failed)
	 25 - OpenPsiSCMUTest (Failed)
	 30 - OpenPsiTest (Failed)
	 35 - MinerUTest (Failed)
	 36 - SurprisingnessUTest (SEGFAULT)
	 37 - GhostSyntaxUTest (Failed)
	 38 - GhostProcedureUTest (Failed)
	 39 - GhostUTest (Failed)

vs running make test after make install:

The following tests FAILED:
	 17 - AnaphoraTest (Child aborted)
	 30 - OpenPsiTest (Failed)

My main concern is that this will make it difficult to ensure we are testing the current build vs whatever is in /usr/local at the time.

AnaphoraTest

Huh. This is passing just fine for me. I fixed it just a few days ago ...!

commented

The subtests in AnaphoraTest that were failing now pass for me ok.

But AnaphoraTest aborts at: "Testing the propose function ... Too many root sets".

Too many root sets

Ugh. That is a garbage collector limitation. It can be gotten around by recompiling the garbage collector to use the "huge memory model" instead of the default "large memory model". But I assume you are using whatever apt-get install provided. Yuck. Oh well. (It's a googlable error message, if you're curious).

(I'm using guile-2.9.2 not guile-2.2; it seems faster. I also manually set up the "huge memory model" cause my datasets are .. huge. Well, given Moore's law, not as huge as they used to be... )

commented

I installed guile-2.2.4 from source but didn't configure it in any special way. I tried googling but couldn't find anything obvious on how set up guile to use a "huge memory model".

Can you clarify if the GC is part of guile, or if guile makes use of some external GC library I have to mess with? I didn't see anything in guile's configure script help.

The CI build uses 2.2.3, so that should probably also be updated as AnaphoraTest gets the same error there. Should we bump the requirements on guile or add some instructions on how to handle this?

GC is not a part of guile, its Boehm-GC - on debian its the libgc-dev and libgc1c2 dpkg's. It's a prereq for guile, and most other things that use GC, except for Java, which has it's own thing going. The source is here: https://www.hboehm.info/gc/ --- https://en.wikipedia.org/wiki/Boehm_garbage_collector --- https://github.com/ivmai/bdwgc

when I last built it a few years ago, I did it by saying ./configure --enable-large-config

commented

related ivmai/bdwgc#83

So I guess the question is: is there any way to make AnaphoraTest pass, in a sensible way, without a custom bdwgc build? (I know nothing about how it works currently, so have no idea if it's just the nature of the problem or not)

Well, that's interesting! The code in Anaphora looks entirely reasonable, and should not put any harsh demands on the system. Certainly, the quantity and complexity of the scheme code in subsystems like ghost and relex2logic and pln is far greater. My current guess is that the mixture of python and guile has to do with it ... the GC is having to crawl over the memory managed by python, maybe it's discovering tens of thousands of smart-pointers in python, and overflowing some array. Why we hit it here, and not in some Hanson Robotics ROS+ghost stack, I don't know. Might be which library got initialized first. When guile initializes, is snapshots the C stack and etc. precisely so that it knows what RAM its supposed to manage, and what to leave alone.

the question is: is there any way to make AnaphoraTest pass, in a sensible way, without a custom bdwgc build?

Not that I know of.

See also #2088

commented

I was mistaken about AtomSpace. I had to make install to get all AtomSpace tests working too.

My cleaning script nukes these directories:

clean() {
    sudo rm -r /home/$USER/.cache/guile/ccache
    sudo rm -r /usr/local/lib/opencog/
    sudo rm -r /usr/local/lib/python3/dist-packages/opencog
    sudo rm -r /usr/local/lib/python3.5/dist-packages/opencog
    sudo rm -r /usr/local/lib/python3.6/dist-packages/opencog
    sudo rm -r /usr/local/share/opencog
    sudo rm -r /usr/local/share/guile/site/2.2/opencog*
    sudo rm -r ~/.virtualenvs/opencog/lib/python3.6/site-packages/opencog
}
...

After that, all I do is:

  • run make install inside the cogutil build dir
  • then make a new build dir for the atomspace
  • cd build && make && make test and boom 41 failures.
  • sudo make install && make test and boom, all successes.

Strangely the circleci build doesn't need to do make install. But I was unable to determine what the opencog-deps docker image does differently to avoid running make install. I tried copying it's ~/.guile config:

; Add path to OpenCog modules
(add-to-load-path "/usr/local/share/opencog/scm")

; Add present directory
(add-to-load-path ".")

; To make working with arrow keys easier
(use-modules (ice-9 readline))
(activate-readline)

; Enable showing of backtrace on error
(debug-enable 'backtrace)

; Record positions of source code expressions.
(read-enable 'positions)

But this had no effect. This hidden config in the build image isn't confidence inspiring. If such environmental config is necessary then it should be in the circleci build config or in cmake. Given I only discovered this file by accident, I have low confidence there isn't some magic configuration that I'm missing.

(add-to-load-path "/usr/local/share/opencog/scm")

This does nothing; its an obsolete path, nothing is there any more. It should be removed from code and documentation.

boom 41 failures

This is surprising; the scheme infrastructure explicitly adds these lines:

./guile/SchemeSmob.cc:	scm_c_eval_string("(add-to-load-path \"" PROJECT_SOURCE_DIR "/opencog/scm\")");
./guile/SchemeSmob.cc:	scm_c_eval_string("(add-to-load-path \"" PROJECT_BINARY_DIR "\")");

They are there so that unit tests can pass without the install. I think this is a horribly hacky way of making unit tests pass without install .. but whatever. This should have been enough. Perhaps your CMake is setting PROJECT_SOURCE_DIR or PROJECT_BINARY_DIR to some unexpcected locations?

Also I think the ~/.guile file is used only if you run the REPL shell; otherwise it would be ignored. I think there is only one unit test that uses the REPL shell.

@ferrouswheel you probably should add /usr/local/include/opencog to your clean function. Here's my script BTW https://github.com/ngeiswei/ocbld that I don't especially share cause it's somewhat personalized.

commented

Thanks folks.

I got annoyed with the dependencies of the current CI system, so I'm building a cleanroom docker image that doesn't do any magic or rely on weird adhoc scripts. all dependencies/modifications from stock ubuntu will be right there in the dockerfile and I'll be able to build from a clean image whenever I like.

With this image I was able to reproduce the test failures when make install isn't run first. I'll post more tomorrow as I figure out where the problem/difference is.

@ferrouswheel , one of the problems could be the fact that LD_LIBRARY_PATH doesn't override shared libraries loading path for giule. For the first build on clean system guile before make install loads libraries from build directory. But if your did make install, checkout another branch, did make and make test then guile will still load shared libraries from INSTALL_PREFIX not from build directory.

The reason is that guile uses libltdl library and libltdl has its own LTDL_LIBRARY_PATH to override search path. I am not sure whether it is issue or expected behaviour of libltdl but atomspace and opencog builds suffer from this behavior.

commented

Thanks @vsbogd - that was certainly part of it.

I picked a random failing test to focus on, in this case tests/query/PresentLinkUTest.

After tracing with LD_DEBUG, I've discovered part of the problem is due to libexec.so dynamic library of ExecutionLink not being in the LTDL_LIBRARY_PATH...

However this fails in a special way, because libexec.so is also system library 😭 . So it pretends to load but doesn't define any of the expected scheme variables. To avoid this we should probably rename our libraries to not have the same names as common system libraries! I'm mistaken, see a couple of comments below.

Inspecting the test, it has a sensible RUNPATH:

~/src/atomspace/build/tests $ objdump -x query/PresentUTest | grep RUNPATH
  RUNPATH              /home/joel/work/opencog/atomspace/build/opencog/atomspace:/home/joel/work/opencog/atomspace/build/opencog/query:/home/joel/work/opencog/atomspace/build/opencog/util:/home/joel/work/opencog/atomspace/build/opencog/guile/modules:/home/joel/work/opencog/atomspace/build/opencog/ure:/home/joel/work/opencog/atomspace/build/opencog/atomspaceutils:/home/joel/work/opencog/atomspace/build/opencog/unify:/home/joel/work/opencog/atomspace/build/opencog/guile:/home/joel/work/opencog/atomspace/build/opencog/atoms/pattern:/home/joel/work/opencog/atomspace/build/opencog/atoms/execution:/home/joel/work/opencog/atomspace/build/opencog/atoms/reduct:/home/joel/work/opencog/atomspace/build/opencog/cython:/home/joel/work/opencog/atomspace/build/opencog/atoms/core:/home/joel/work/opencog/atomspace/build/opencog/atoms/base:/home/joel/work/opencog/atomspace/build/opencog/atoms/truthvalue:/home/joel/work/opencog/atomspace/build/opencog/atoms/value:/home/joel/work/opencog/atomspace/build/opencog/atoms/atom_types:/usr/local/lib:/usr/local/lib/opencog

Ideally I'd grab that string and set this as the LTDL_LIBRARY_PATH environment variable for any tests that need it. However I haven't figured out how to get that information from a cmake target before it gets built (none of the rpath variables mentioned here were helpful)

So what I'm currently thinking is that I'll create a new ADD_GUILE_TEST macro that manually sets these runpaths. OpenCog repo would also have one with different build paths.

commented

Actually I just had a thought that the reason the rpath is empty might be that our cmake test targets shadow the executable target (they have the same name). I'll explore some more...

because libexec.so is also system library

?

which one ? (none-such on ubuntu/debian)

dpkg -S libexec
dpkg-query: no path found matching pattern *libexec*

and

sudo ldconfig -p | grep libexe
	libexempi.so.3 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libexempi.so.3

anyway, if it conflicts with some other lib from some other package, we can rename it.

commented

Oops, my mistake. I misinterpreted the LD_DEBUG output and then got confused with the libexec executable that seems to be associated with clang/llvm. Sorry.

dpkg -S libexec shows me ~30 matches on Ubuntu 18.04.

ok, well libexec is a terrible name anyway.

commented

I went back to the Ubuntu 16.04 image and tests ran successfully without make install

Inspecting library loading with LD_DEBUG I found it used RPATH instead of RUNPATH

RUNPATH and RPATH are not equivalent in terms of transitive library loading - someone changed the default and this must have propagated in Ubuntu 18.04.

To restore the old behaviour one can specify -Wl,--disable-new-dtags to the linker.

Some other related resources:

commented

One specific thing to note, mentioned in the Qt blog reference, is that setting an LD_LIBRARY_PATH overrides any RUNPATH which is the reverse of how RPATH worked (RPATH takes priority over LD_LIBRARY_PATH).

I'm not sure if opencog.scm always setting a LTDL_LIBRARY_PATH is similarly short circuiting the RUNPATH of libsmob.so. I'll need to do some experiments when I come back to this.

The setting of LTDL_LIBRARY_PATH in opencog.scm is a kind-of-ish hack, meant only to allow the guile loader to find the needed shlibs. Between RUNPATH, RPATH, the two LIBRARY_PATH's, and several choices for where to install the libraries, it's hard to see what the best solution is. A decade ago, it seemed like there was a clear-cut answer, with standards committees e.g. the LSB telling you exactly how how to be conformant. Since then, it feels like different distros each went their own way, major subsystems invented their own clashing policies and so... Beats me, I can't keep track. There is at least one distro which tries to avoid this by installing each and every app in it's own fenced-off playground: viz. "nix" .. we have https://github.com/opencog/opencog-nix for it. Also https://en.wikipedia.org/wiki/Nix_package_manager I have not actually tried it.

commented

So I've come up with what I consider a more reliable system that gets rid of LTDL_LIBRARY_PATH and follows what Guile suggests here. i.e. using an explicit path. This path is generated by CMake and stored in generated guile module (opencog as-config) - the config file is configured for both the build directory, and when running make install it is configured for the install prefix. This means no more guessing and library loading works with the modern RUNPATH tag.

It also entirely gets rid of the issue that @vsbogd has mentioned with LTDL_LIBRARY_PATH, since it's not used.

The config is called as-config because it only provides paths for the atomspace. OpenCog repo will have to generate it's own config at build time, and I was planning to call that oc-config. These are internal modules that are not intended to be seen by users, but if atomspace moves to it's own namespace one day, then they can both be named config to make it more consistent.

I've also made these config files abort if a env var indicates testing is in progress, but the config is being loaded from the system path. This has highlighted where some of our tests are misconfigured and will use whatever is install in the system dirs in preference to the build dir.

I've also discovered that some tests will test scm files that are in the cmake build dir, and some will test scm files from the source dir. I can see this may be nice for editing scm files in-place, and rerunning test executables without having to run make... but this is another layer of confusion about what code is actually being tested. I'm not sure if I should allow this use case or not (it's not difficult to do so, but adds cognitive load to figuring which tests use which version of which code... I'd personally prefer certainty over convenience).

I'll make a PR soon to show want I'm suggesting.

commented

See PR opencog/atomspace#2238 - if this is acceptable I can implement the same for OpenCog.

Well, I have two primary complaints about opencog/atomspace#2238 -- one is that it is trying to over-ride default decisions already made in OpenCogGuile.cmake -- if the defaults need to be changed, it would be better to just change them in OpenCogGuile.cmake

The other would be a more principled development -- so, some of the CMakefile already copy scm files into the build dir, precisely so that unit tests can run without a prior install. Maybe the failing tests are exactly the ones that are not doing this copy. Either all CMakes should perform the copy, or none of them should -- it should not be half-n-half.

The other hack is that many/most of the unit tests contain lines like scm_c_eval_string("(add-to-load-path \"../../..\")"); because they are trying to guess where the source dir is, so that the needed scm files are found. That path-guessing is both hacky, and fragile -- it keeps breaking every so often, and needs adjustment as files move around. So the right place to start would be to bulk-remove all the scm_c_eval_string("(add-to-load-path \"../../..\")"); hackery, and then move forward.

I'm thinking the right fix is to twiddle OpenCogGuile.cmake so that the library paths are set up correctly .. viz .. OpenCogGuile.cmake is presumabely incomplete or buggy -- the LTDL mechanism is older than OpenCogGuile.cmake, and when this cmakefile was introduced, the LTDL mechanism was never updated/removed.

commented