ryanb / xapit

High level Ruby library for interacting with Xapian, a full text search engine.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Updating existing index

ryanb opened this issue · comments

Instead of clearing the database and recreating the index each time, there should be support for only re-indexing the changed/created records. This can somewhat be determined using the updated_at timestamp in Rails applications and looking at the time the database was created. Possible gotchas include:

  • time zone differences between updated_at timestamp and local system time.
  • attributes changed through associated records (possible to use the "touch" method in newer versions of Rails to get around this)
  • leftover facet option records (shouldn't hurt anything)
  • the updating will likely happen in a separate process, how do we communicate this to the main process?
  • if a record is deleted, how do we mark it as needing to be removed from the index? One way is to have a separate table but that can be messy.
  • changes to xapit block won't be handled properly (no way to get around this, just communicate it)

An alternative solution which solves some of these issues is to keep the Xapian database loaded under a different process and use a REST api or something similar to communicate with it. This way it can have the writable database always loaded and update it on the fly as records change. Of course the downside is it would require a separate process...

I really like your updated_at solution. I think the biggest strength that Xapian has versus other search engines is the fact that you do not need a separate daemon, so I think a REST api would be cool, but it shouldn't be the default mode of operation. It might be cool to use a message queue though?

Right, I will probably provide both solutions in the long run. If you have a large Xapian database then it will take up much more memory because it is loaded for each separate Rails process. If you have 5 instances, it is much more efficient to keep the Xapian database under a separate process. This would also greatly simplify keeping the index up-to-date and changes would be indexed instantly. Something like this:

Config.setup(:database_path => "http://localhost:4321")

But on the implementation side this can be difficult. I would need to create a Database, Document and Query proxy objects which mimic the behavior of the Xapian ones, but interact remotely.

Since looking at the updated_at is much easier, I'll go with that solution first. Like you said, it is also more convenient for smaller sites so it doesn't require a separate process.

Any ideas on how to handle deleted records? The best I can come up with is recording it to a separate table like acts_as_xapian does.

The way acts_as_xapian does it is quite clever, imho. But the downside it that it ties xapit to Rails and ActiveRecord, which might be undesirable. It's a shame that Workling is so heavily tied to Rails, otherwise that could have been a really nice solution. Maybe you could have some kind of pluggable system for this?

class XapitJobs < ActiveRecord::Base

  def self.update_record(record_id, state)
    self.create!(:record_id => record_id, :state => state)
  end

  def self.index_records
    all.each do |record|
      case record.state
      when "updated"
        # ...
      when "created"
        # ...
      when "deleted"
      end
    end
  end

end

class WorklingJobs

  class MyWorklingWorker < Workling::Base
    def index(options)
      # ...
    end
  end

  def self.update_record(record_id, state)
    MyWorklingWorker.asynch_index(:record_id => record_id, :state => state)
  end

end

Xapit::Config.setup(:index_dispatcher => XapitJobs)

I am closing this issue because I will be making two separate projects (each with their own solution) for handling this problem.

Xapit Sync is similar to how acts_as_xapian currently does it. It will keep track of changes in a separate table and use a separate process to update the Xapian database.

Xapit Server is similar to my solution mentioned earlier which requires a separate daemon process to handle searching and record changes through a rack server and REST API.