larroy / mycelium

An open source information retrieval system written in C++11 and Python. Aspires to be an alternative to Nutch / Lucene. It uses MongoDB as an storage engine.

Home Page:http://larroy.github.com/mycelium

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The Mycelium Information retrieval system
=========================================

Check the latest up-to-date user documentation at http://pedro.larroy.com/mycelium/sphinx/

For the impatient
-----------------
$ ./bootstrap.sh
$ ./build.py
$ scons
$ build/release/crawler
$ echo 'http://google.com' | nc localhost 1024


Status of the modules / Features
================================
- Crawler:
    
    Feature complete.

- Tokenizer/Stemmer:
    
    Work in progress.

- Inverted index:
    
    TODO

- Search frontend:

    TODO



How to use it
-------------

- Initialize the git submodules:

$ git submodule init
$ git submodule update


- Build the 3rd-party libraries:

$ ./build.py

- Compile the sources with SCons:

$ scons

- Alternatively you might build with system curl:

$  scons --system_curl

- But as said previously, synchronous DNS resolving will harm the performance
  and block. So it's  not recommended unless curl has been compiled with
  c-ares, as it will be done by build.py

Running
-------

- The environment variables that affect some configuration parameters are:
    * Specific for the crawler:
        MYCELIUM_CRAWLER_PORT: port to listen for urls
        MYCELIUM_CRAWLER_PARALLEL: number of parallel crawlers to run

    * General for all the tools that interact with the DB:

    MYCELIUM_DB_HOST: mongodb host for storing the documents, default is "localhost"
    MYCELIUM_DB_NS: database.collection, defaults to "mycelium.crawl"

Dependencies
============

The software is build on debian / ubuntu systems, although it should be fairly
easy to port to other platforms.

Some (might be incomplete) list of libraries that are required:

- z
- boost_filesystem
- boost_system
- boost_regex
- log4cxx
- pthread
- curl
- event
- ssl
- libidn11-dev

Other software:

- scons
- flex
- autoconf (for building curl and c-ares)


Troubleshoting
==============

AttributeError: 'SConsEnvironment' object has no attribute 'CXXFile':
File "/home/piotr/devel/mycelium/SConstruct", line 210:
variant_dir='build/{0}'.format(build), duplicate=0)
File "/usr/lib/scons/SCons/Script/SConscript.py", line 614:
return method(*args, **kw)
File "/usr/lib/scons/SCons/Script/SConscript.py", line 551:
return _SConscript(self.fs, *files, **subst_kw)
File "/usr/lib/scons/SCons/Script/SConscript.py", line 260:
exec _file_ in call_stack[-1].globals
File "/home/piotr/devel/mycelium/src/SConscript", line 11:
env.CXXFile(target='Robots_flex.cc',source='robots.ll')

To solve this problem make sure flex is installed.

About

An open source information retrieval system written in C++11 and Python. Aspires to be an alternative to Nutch / Lucene. It uses MongoDB as an storage engine.

http://larroy.github.com/mycelium

License:Other


Languages

Language:C++ 87.7%Language:Python 7.8%Language:LLVM 4.5%Language:Shell 0.1%