NYStud / phinde

Self-hosted search engine for your static blog

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

phinde - generic web search engine

Self-hosted search engine you can use for your static blog or about any other website you want search functionality for.

My live instance is at http://search.cweiske.de/ and indexes my website, blog and all linked URLs.

Features

  • Crawler and indexer with the ability to run many in parallel
  • Shows and highlights text that contains search words
  • Boolean search queries:
    • foo bar searches for foo AND bar
    • foo OR bar
    • title:foo searches for foo only in the page title
  • Facets for tag, domain, language and type
  • Date search:
    • before:2016-08-30 - modification date before that day
    • after:2016-08-30 - modified after that day
    • date::2016-08-30 - exact modification day match
  • Site search
    • Query: foo bar site:example.org/dir/
    • or use the site GET parameter: /?q=foo&site=example.org/dir
  • OpenSearch support with HTML and Atom result lists
  • Instant indexing with WebSub (formerly PubSubHubbub)

Dependencies

  • PHP 5.5+
  • Elasticsearch 2.0
  • MySQL or MariaDB for WebSub subscriptions
  • Gearman (Debian 9: gearman-job-server, not gearman-server)
  • PHP Gearman extension
  • Console_CommandLine
  • Net_URL2
  • Twig 1.x

Setup

  1. Install and run Elasticsearch and Gearman

  2. Install php-gearman

  3. Get a local copy of the code:

    $ git clone https://git.cweiske.de/phinde.git phinde
    
  4. Install dependencies via composer:

    $ composer install
    
  5. Point your webserver's document root to phinde's www directory

  6. Copy data/config.php.dist to data/config.php and adjust it. Make sure your add your domain to the crawl whitelist.

  7. Create a MySQL database and import the schema from data/schema.sql

  8. Run bin/setup.php which sets up the Elasticsearch schema

  9. Put your homepage into the queue:

    $ ./bin/process.php http://example.org/
    
  10. Start at least one worker to process the crawl+index queue:

    $ ./bin/phinde-worker.php
    
  11. Check phinde's status page in your browser. The number of open tasks should be > 0, the number of workers also.

Re-index when your site changes

When your site changed, the search engine needs to re-crawl and re-index the pages.

Simply tell phinde that something changed by running:

$ ./bin/process.php http://example.org/foo.htm

phinde supports HTML pages and Atom feeds, so if your blog has a feed it's enough to let phinde reindex that one. It will find all linked pages automatically.

Website integration

Adding a simple search form to your website is easy. It needs two things:

  • <form> tag with an action that points to the phinde instance
  • Search text field with name of q.

Example:

<form method="get" action="http://phinde.example.org">
  <input type="text" name="q" placeholder="Search text"/>
  <button type="submit">Search</button>
</form>

System service

When using systemd, you can let it run multiple worker instances when the system boots up:

  1. Copy files data/systemd/phinde*.service into /etc/systemd/system/

  2. Adjust user and group names, and the work directories

  3. Enable three worker processes:

    $ systemctl daemon-reload
    $ systemctl enable phinde@1
    $ systemctl enable phinde@2
    $ systemctl enable phinde@3
    $ systemctl enable phinde
    $ systemctl start phinde
    
  4. Now three workers are running. Restarting the phinde service also restarts the workers.

Cron job

Run bin/renew-subscriptions.php once a day with cron. It will renew the WebSub subscriptions.

Howto

Delete index data from one domain:

$ curl -iv -XDELETE -H 'Content-Type: application/json' -d '{"query":{"term":{"domain":"example.org"}}}' http://127.0.0.1:9200/phinde/_query

That's delete-by-query 2.0, see https://www.elastic.co/guide/en/elasticsearch/plugins/2.0/delete-by-query-usage.html

Subscribe to a website/feed

Phinde supports WebSub to get subscribe to changes of a website. When phinde gets notified by the website's hub about changes, it will immediately crawl and index the changed pages.

Subscribe to a website's feed:

$ php bin/subscribe.php http://example.org/feed.atom

Phinde will determine the website's hub and send a registration request to it.

The status page will show the number of working, and the number of open subscriptions.

Unsubscribing also happens on command line:

$ php bin/unsubscribe.php http://example.org/feed.atom

About phinde

Source code

phinde's source code is available from http://git.cweiske.de/phinde.git or the mirror on github.

License

phinde is licensed under the AGPL v3 or later.

Author

phinde was written by Christian Weiske.

About

Self-hosted search engine for your static blog

License:GNU Affero General Public License v3.0


Languages

Language:PHP 88.0%Language:HTML 11.2%Language:CSS 0.8%