Solr as "sidecar" rather than as primary store?

Question

Solr as "sidecar" rather than as primary store?

nicolasfranck opened this issue 7 years ago · comments

Wouldn't it be better to use a store like Solr as an extra store, rather than as a primary store?
Solr is good for search, not for backend storage.

Maybe add a "sidecar store" functionality:

add annotation to "safe" backend store (rdf store)
add annotation to Solr (optional). When configured, the search api also uses this store.

Glen Robson · Answer 1 · Fri Nov 24 2017 18:02:23 GMT+0800 (China Standard Time)

Hi Nicolas,

Your not the only one to suggest this (cc @eefahy) also has similar concerns. I'm happy to consider changing this but would like to understand the issues with using SOLR as backend storage. Are there any issues you are aware of or experience which you've had with using SOLR which would impact the current approach?

The downside of having SOLR as a sidecar is the added complexity and keeping everything in sync. I have considered having maybe a file based storage as the primary storage mechanism for annotations and then using JMS messaging to be more generic on possible side cars but when should the annotation be considered to be 'accepted' by the annotation store, when it passes the file level store or when its successfully gets through all of the side cars? If its only the 'main/safe' annotation store then there is a risk that the annotation may fail when it gets to SOLR and by then control has already returned to the user so they won't be informed of the failure. If you wait until all interested parties have successfully processed the annotation you're potentially adding a delay to the storage of the anno.

I would also have to add functionality for re-indexing in case the two storage mechanisms get out of sync.

As mentioned above I'm happy to look at moving to the SOLR side car approach but want to ensure it brings enough functionality (and reassurance to users) to justify the added complexity it will bring.

Thanks

Glen

Nicolas Franck · Answer 2 · Sat Nov 25 2017 07:35:15 GMT+0800 (China Standard Time)

Indeed, it would require much complexity ensuring that both stores (backend and search) are synchronized at the same time. But I think it should first be stored safely, reported to the user, and then send to the sidecar(s) asynchronously. Search functionality can wait, safe storage comes first. Using a incremental version number would for example help reminding which annotation to index/delete.

Reason why Solr is less "safe" is because it looks less "clean" to me:

no transaction isolation ( another client can commit for you )
indexes can become corrupt after a while ( and therefore must be reindexed from scratch )
when the internal lucene format changes, you will have to reindex eventually, using an external source
solr does not have any normalization
some fields are created just for the sake of search functionality ( tokenized, non tokenized .. ).
Changing this requires you to reindex.

I know, it's a matter of taste ;-)

Glen Robson · Answer 3 · Sat Jul 28 2018 10:23:23 GMT+0800 (China Standard Time)

I've been thinking about this more and in particular have come across issues with installing SOLR on AWS and keeping the data safe between restarts. I think to solve both these issues I am going to:

Use Jena as the primary datastore
Setup a internal Java notification system that adapters can subscribe to. (e.g. SOLR and ElasticSearch. which seems to be better supported on AWS).
Setup an activity stream so external services can keep up to date with annotation changes.

The activity stream should also allow a re-sync of ElasticSearch if it gets behind or corrupted. The internal Java Notification System would be in a different thread to the user saving an annotation so they won't be held up.

Comments welcome!

Nicolas Franck · Answer 4 · Mon Aug 06 2018 14:51:49 GMT+0800 (China Standard Time)

A few comments:

document the conversion from primary storage to secondary. This way anyone can understand it.
for full reindexing a simple script should be enough. It has the advantage that you can reindex without commit after every insert. Of course anyone can do this, iterating the primary store, but it is cleaner this way.