nbittich / mu-search

Search facility for mu-semtech, powered by ElasticSearch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mu-search

A component to integrate authorization-aware full-text search into a mu.semte.ch stack using Elasticsearch.

Tutorials

Add mu-search to a stack

The mu-search service uses Elasticsearch as a backend. Since the Elasticsearch docker image requires a lot of memory, increase the maximum on your system by executing the following command:

sysctl -w vm.max_map_count=262144

Next, add the mu-search and accompanying elasticsearch service to docker-compose.yml

services:
  search:
    image: semtech/mu-search:0.9.0
    links:
      - db:database
    volumes:
      - ./config/search:/config
  elasticsearch:
    image: semtech/mu-search-elastic-backend:1.0.0
    volumes:
      - ./data/elasticsearch/:/usr/share/elasticsearch/data
    environment:
      - discovery.type=single-node

The indices will be persisted in ./data/elasticsearch. The search service needs to be linked to an instance of the mu-authorization service.

Create the ./config/search directory and create a config.json with the following contents:

{
    "types" : [
        {
            "type" : "document",
            "on_path" : "documents",
            "rdf_type" : "http://xmlns.com/foaf/0.1/Document",
            "properties" : {
                "title" : "http://purl.org/dc/elements/1.1/title",
                "description" : "http://purl.org/dc/elements/1.1/description"
            }
        },
        {
            "type" : "user",
            "on_path" : "users",
            "rdf_type" : "http://xmlns.com/foaf/0.1/Person",
            "properties" : {
                "fullname" : "http://xmlns.com/foaf/0.1/name"
            }
         }
    ]
}

Finally, add the following rules to your dispatcher configuration in ./config/dispatcher.ex to make the search endpoint available:

  define_accept_types [
    json: [ "application/json", "application/vnd.api+json" ]
  ]

  @json %{ accept: %{ json: true } }

  get "/search/*path", @json do
    Proxy.forward conn, path, "http://search/"
  end

Restart the dispatcher service to pick up the new configuration

docker-compose restart dispatcher

Restart the stack using docker-compose up -d. The elasticsearch and search services will be created.

Search queries can now be sent to the /search endpoint. Make sure the user has access to the data according to the authorization rules.

How-to guides

How to persist indexes on restart

By default search indexes are deleted on (re)start of the mu-search service. This guide describes how to make sure search indexes are persisted on restart. Obviously, this configuration is recommended on production environments.

First, make sure the search indexes are written to a mounted volume by specifying a bind mount to /usr/share/elasticsearch/data on the Elasticsearch container.

services:
  elasticsearch:
    image: semtech/mu-search-elastic-backend:1.0.0
    volumes:
      - ./data/elasticsearch/:/usr/share/elasticsearch/data

Recreate the elasticsearch container by executing the following command

docker-compose up -d

Next, enable the persistent indexes flag in the root of the search configuration file ./config/search/config.json of your project.

{
  "persist_indexes": true,
  "types": [
    // index type specifications
  ]
}

Restart the search service to pick up the new configuration.

docker-compose restart search

Search indexes will be persisted in ./data/elasticsearch folder and not be deleted on restart of the search service.

How to prepare a search index on startup

The search API provided by mu-search is authorization-aware. I.e. search results will only contain resources the user is allowed to access. To this end mu-search organises its search indexes per access right. Based on the user's allowed groups set on the incoming search requests, mu-search determines which indexes to search in.

Indexes that don't exist yet will be created before the search operation is performed. Depending on the number of documents to index this may be a time-consuming operation.

Mu-search allows to configure authorization groups for which the indexes need to be created on startup already. This will save time at the moment the first search query for that profile arrives.

Configuration is done via the eager_indexing_groups in the search configuration file ./config/search/config.json. The eager indexing groups are tightly related to the GroupSpecobjects configured in mu-authorization.

The eager_indexing_groups is an array of group specifications. Each group specification is defined by an array of objects in which each object consists of:

  • name: name of the group specification (GroupSpec) in mu-authorization
  • variables: array of string values used to construct the graph URI for the group. These variables should match the possible result values of the vars in case of an AccessByQuery access rule in the GroupSpec. In case of an AlwaysAccessible access rule, this should be an empty array.

Example: public data for unauthenticated users

If the application only provides public data for unauthenticated users in the graph http://mu.semte.ch/graphs/public, the following eager indexing groups must be configured:

[
  [ { "name": "public", "variables" : [] } ],
  [ { "name": "clean", "variables": [] } ]
]

Example: data per organization unit

If, next to the public data, data is organized per organization unit in graphs like http://mu.semte.ch/graphs/<unit-name>, the following eager indexing groups must be configured:

[
  [ { "name": "public", "variables" : [] }, { "name": "organization-unit", "variables" : ["finance"] } ],
  [ { "name": "public", "variables" : [] }, { "name": "organization-unit", "variables" : ["legal"] } ],
  [ { "name": "clean", "variables": [] } ]
]

In case a group contains a variable, an eager index must be configured for each possible value if you want all search indexes to be prepared upfront.

Eager indexes may be combined at search time to match the user's allowed groups. For example, if some users have access to the data of the finance department as well as the legal department, both indexes will be queried when the user performs a search operation.

How to integrate mu-seach with delta's to update search indexes

This how-to guide explains how to integrate mu-search with the delta-notification in order to automatically update search index entries when data in the triplestore is modified.

This guide assumes the mu-authorization and delta-notifier components have been added to your stack as explained in their respective installation guides.

Open the delta-notifier rules configuration ./config/delta/rules.js and add the following rule:

  {
    match: {
      // listen to all changes
    },
    callback: {
      url: 'http://search/update',
      method: 'POST'
    },
    options: {
      resourceFormat: "v0.0.1",
      gracePeriod: 10000,
      ignoreFromSelf: true
    }
  }

Enable automatic index updates (not only invalidation) in mu-search by setting the automatic_index_updates flag at the root of ./config/search/config.json.

{
  "automatic_index_updates": true,
  "types": [
     // definition of the indexed types
  ]
}

Restart the search and delta-notifier service.

docker-compose restart search delta-notifier

Any change you make in your application will now trigger a request to the /update endpoint of mu-search. Depending on the indexed resources and properties, mu-search will update the appropriate search index entries.

How to specify a file's content as property

This guide explains how to make the content of files attached to a project resource searchable in the index.

This guide assumes you have already integrated mu-search in your application and configured an index for resources of type schema:Project.

For indexing files mu-search requires a Tika server to extract the content. Add the tika service next to the search and elasticsearch services in docker-compose.yml:

services:
  search:
    ...
  elasticsearch:
    ...
  tika:
    image: apache/tika:1.25-full

Next, add the following mounted volumes to the mu-search service in docker-compose.yml:

  • /data: folder containing the files to be indexed
  • /cache: folder to persist Tika's search cache
services:
  search:
    image: semtech/mu-search:0.9.0
    volumes:
      - ./config/search:/config
      - ./data/files:/data
      - ./data/search/cache:/cache

Next, add a property files in the project type index configuration. The property files will hold the content and metadata of the files.

{
    "types" : [
        {
            "type" : "project",
            "on_path" : "projects",
            "rdf_type" : "http://schema.org/Project",
            "properties" : {
                "name" : "http://schema.org/name",
                "files" : {
                   "via" : [
                       "http://purl.org/dc/terms/hasPart",
                       "^http://www.semanticdesktop.org/ontologies/2007/01/19/nie#dataSource"
                   ],
                   "attachment_pipeline" : "attachment"
                 }
            }
        }
    ]
}

via expresses the path from the indexed resource to the file(s) having a URI like <share://path/to/your/file.pdf>.

Recreate the mu-search service using

docker-compose up -d

After reindex has been completed, each indexed project will now contain a property files holding the content and metadata of the files linked to the project via dct:hasPart/^nie:dataSource.

Searching the file's content is done using the nested property content on the defined field name, files in this case:

GET /documents/search?filter[files.content]=open-source"

How to inspect the content of a search index

The content of a search index can be inspected by running a Kibana dashboard on top of Elasticseach.

[To be completed...]

Make sure not to expose the Kibana dashboard in a production environment!

How to reset search indexes

[To be completed...]

Reference

Search index configuration

Elasticsearch is used as a search engine. It indexes documents according to a specified configuration and provides a REST API to search documents. The mu-search service is a layer in front of Elasticsearch that allows to specify the mapping between RDF triples and the Elasticsearch documents/properties. It also integrates with mu-authorization making sure users can only search for documents they're allowed to access.

This section describes how to configure the resources and properties to be indexed and how to pass Elasticsearch specific configurations and mapping in the mu-search configuration file.

Indexed resource types and properties

This section describes how to mapping between RDF triples and Elasticsearch documents can be specified in the mounted /config/config.json configuration file.

The config.json file contains a JSON object with a property types. This property contains an array of objects, one per document type that must be searchable.

{
  "types": [
    // object per searchable document type
  ]
}

Note that these types do not map one-on-one with the search indexes in Elasticsearch. For each document type in the list a search index will be created per authorization group.

Each type object in the types array consists of the following properties:

properties contains a JSON object with a key per property in the resulting Elasticsearch document. These are the properties that will be searchable via the search API for the given resource type. The value of each key defines the mapping to RDF predicates starting from the root resource.

WARNING: there are two protected fields that should not be used as property keys: uuid and uri. Both are used internally by the mu-search service to store the uuid and URI of the root resource.

Simple properties

In the simplest scenario, the properties that need to be searchable map one-by-one on a predicate (path) of the resource.

In the example below, a search index per user group will be created for documents and users. The documents index contains resources of type foaf:Documents with a title and description. The users index contains foaf:Persons with only fullname as searchable property.

{
    "types" : [
        {
            "type" : "document",
            "on_path" : "documents",
            "rdf_type" : "http://xmlns.com/foaf/0.1/Document",
            "properties" : {
                "title" : "http://purl.org/dc/elements/1.1/title",
                "description" : "http://purl.org/dc/elements/1.1/description"
            }
        },
        {
            "type" : "user",
            "on_path" : "users",
            "rdf_type" : "http://xmlns.com/foaf/0.1/Person",
            "properties" : {
                "fullname" : "http://xmlns.com/foaf/0.1/name"
            }
         }
    ]
}

If multiple values are found in the triplestore for a given predicate, the resulting value for the property in the search document will be an array of all values.

A property of the search document may also map to an inverse predicate. I.e. resource to be indexed is the object instead of the subject of the triple. An inverse predicate can be indicated in the mapping by prefixing the predicate URI with ^ as done in a SPARQL query.

In the example below the users index contains a property group that maps to the inverse predicate foaf:member relating a group to a user.

{
    "types" : [
        {
            "type" : "user",
            "on_path" : "users",
            "rdf_type" : "http://xmlns.com/foaf/0.1/Person",
            "properties" : {
                "fullname" : "http://xmlns.com/foaf/0.1/name",
                "group": "^http://xmlns.com/foaf/0.1/member"
            }
         }
    ]
}

Properties can also be mapped to lists of predicates, corresponding to a property path in RDF. In this case, the property value is an array of strings. One string per path segment. The array starts from the indexed resource and may also include inverse predicate URIs.

In the example below the documents index contains a property topics that maps to the label of the document's primary topic and a property publishers that maps to the names of the publishers via the inverse foaf:publications predicate.

{
    "types" : [
        {
            "type" : "document",
            "on_path" : "documents",
            "rdf_type" : "http://xmlns.com/foaf/0.1/Document",
            "properties" : {
                "title" : "http://purl.org/dc/elements/1.1/title",
                "description" : "http://purl.org/dc/elements/1.1/description",
                "topics" : [
                  "http://xmlns.com/foaf/0.1/primaryTopic",
                  "http://www.w3.org/2004/02/skos/core#prefLabel"
                ],
                "publishers": [
                  "^http://xmlns.com/foaf/0.1/publications",
                  "http://xmlns.com/foaf/0.1/name"
                ]
            }
        }
    ]
}
File content property

To make the content of a file searchable, it needs to be indexed as a property in a search index. Basic indexing of PDF, Word etc. files is provided using a local Apache Tika instance. A default ingest pipeline named attachment is created on startup of the mu-search service. Note that this is under development and liable to change.

Defining a property to index the content of a file requires the following keys:

  • via : mapping of the RDF predicate (path) that relates the resource with the file(s) to index. The file URI the predicate path leads to must have a URI starting with share:// indicating the location of the file. E.g. <share://path/to/your/file.pdf>.
  • attachment_pipeline : attachment pipeline to use for indexing the files. Set to attachment to use the default ingest pipeline.

The example below adds a property files in the project type index configuration. The property files will hold the contents of the files related to the project via dct:hasPart/^nie:dataSource.

{
    "types" : [
        {
            "type" : "project",
            "on_path" : "projects",
            "rdf_type" : "http://schema.org/Project",
            "properties" : {
                "name" : "http://schema.org/name",
                "files" : {
                   "via" : [
                       "http://purl.org/dc/terms/hasPart",
                       "^http://www.semanticdesktop.org/ontologies/2007/01/19/nie#dataSource"
                   ],
                   "attachment_pipeline" : "attachment"
                 }
            }
        }
    ]
}

For each file retrieved through the via-definition, the Tika-processing results in an object containing the extracted text (as content), as well as other extracted metadata (in the future). Such object may look like this:

{
  content: "Extracted text here"
}

These objects are structured in the same way as the attachment objects resulting from the Elasticsearch's Ingest Attachment Processor Plugin. Keep in mind that this implies you need to specify the path to a specific property of the attachment object when defining an Elasticsearch mapping. E.g. mapping the file's content for the files field from the example above may look as follows:

{
  "types": [
    {
      "type": "project",
      "on_path": "projects",
      ...
      "mappings" : {
        "properties": {
          "name" : { "type" : "text" },
          "files.content" : { "type" : "text" }
        }
      }
    },
    // other type definitions
  ]
}

Currently, only indexing of local files is supported. The files' logical path as well as other metadata is expected to be in the format specified by the file-service. Files must be present in the Docker volume /data inside the container.

Attachments processed by Tika are cached in the directory /cache (by SHA256 of the file contents). This must be defined as a shared volume for the cache to be persistent.

See also "How to specify a file's content as property".

[Experimental] Combining resources of multiple types into one index

It's possible to map several resources of different rdf classes onto one index where that makes sense, e.g. if they share the same properties.

in config.json:

      "rdf_type": [
        "http://data.vlaanderen.be/ns/besluit#Bestuurseenheid",
        "http://data.lblod.info/vocabularies/erediensten/CentraalBestuurVanDeEredienst",
        "http://data.lblod.info/vocabularies/erediensten/BestuurVanDeEredienst",
        "http://data.lblod.info/vocabularies/erediensten/RepresentatiefOrgaan"
      ],

Note that this is different from a composite index, where each type has its own index, as well as being indexed in the composite index. Another difference is that the composite index allows mapping different properties from the sub indexes onto one property in the composite index.

Nested objects

A search document can contain nested objects up to an arbitrary depth. For example for a person you can nest the address object as a property of the person search document.

A nested object is defined by the following properties:

  • via : mapping of the RDF predicate that relates the resource with the nested object. May also be an inverse URI, or a list of predicate (a property path) as in non-nested properties
  • rdf_type : URI of the rdf:Class of the nested object
  • properties : mapping of RDF predicates to properties for the nested object

Objects can be nested to arbitrary depth. The properties object is defined in the same way as the properties of the root document, but the properties of a nested object cannot contain file attachments.

Elasticsearch mappings for nested objects must be specified in the mappings object at the root type using a path expression as key.

In the example below the document's creator is nested in the author property of the search document. The nested person object contains properties fullname and the current project's title as project.

{
    "types" : [
        {
            "type" : "document",
            "on_path" : "documents",
            "rdf_type" : "http://xmlns.com/foaf/0.1/Document",
            "properties" : {
                "title" : "http://purl.org/dc/elements/1.1/title",
                "description" : "http://purl.org/dc/elements/1.1/description",
                "author" : {
                    "via" : "http://purl.org/dc/elements/1.1/creator",
                    "rdf_type" : "http://xmlns.com/foaf/0.1/Person",
                    "properties" : {
                        "fullname" : "http://xmlns.com/foaf/0.1/name",
                        "project": [
                            "http://xmlns.com/foaf/0.1/currentProject",
                            "http://purl.org/dc/elements/1.1/title"
                        ]
                    }
                }
            },
            "mappings": {
              "properties": {
                "title" : { "type" : "text" },
                "author.fullname": { "type" : "text" }
              }
            }
        }
    ]
}

NOTE: currently mu-search does not take the rdf_type of the nested object into account. In the above example, any resource linked via the dct:creator predicate would be included in the elasticsearch document.

[Experimental] Multilingual properties

Mu-search has experimental support for multilingual values. This can be done by setting the type of a property to language-string. Background on this feature can be found in rfcs/multi-language-search.md

For example:

{
    "types" : [
        {
            "type" : "document",
            "on_path" : "documents",
            "rdf_type" : "http://xmlns.com/foaf/0.1/Document",
            "properties" : {
                "title" : {
                  "via": "http://purl.org/dc/elements/1.1/title",
                  "type": "language-string"
                }
            },
            "mappings": {
              "properties": {
                "title.default" : { "type" : "text" },
                "title.en": { "type" : "text" }
              }
            }
         }
      ]
}

When setting a property type to language-string, mu-search will include the language tag of the literal in the search index. In the above example the title field would be expanded to a language container in the document:

{
  "title": {
    "en": ["the english title"],
    "default": ["this literal had no language tag"]
  }
}

Literals without a language string are mapped onto the "default" field.

For searching, make sure to either specify the appropriate field (filter[title.en]=xyz or make use of a wildcard: filter[title.*]=xyz.

It's often advised to configure language specific analyzers for each language, this can be done in the mappings sections of the configuration.

[Experimental] Composite types

A search index can contain documents of different types. E.g. documents (foaf:Document) as well as creative works (schema:CreativeWork). Currently, each simple type the composite index is constituted of must be defined separately in the index configuration as well.

A definition of a composite type index consists of the following properties:

  • type : name of the composite type
  • composite_types : list of simple type names that constitute the index
  • on_path : path on which the search endpoint will be published
  • properties : mapping of RDF predicates to document properties for each simple type

In contrast to the properties of a simple index, the properties of a composite index is an array. Each entry in the array is an object with the folliwng properties:

  • name : name of property of the search document
  • mappings : mapping to the simple type property per simple type. If the mapping for a simple type is absent, the same property name as the composite document is assumed.

The example below contains 2 simple indexes for documents and creative works, and a composite index dossier containing both simple index types. The composite index contains (1) a property name mapping to the document's title and creative work's name property respectively, and (2) a property description mapping to the description property for both simple types.

{
    "types" : [
        {
            "type" : "document",
            "on_path" : "documents",
            "rdf_type" : "http://xmlns.com/foaf/0.1/Document",
            "properties" : {
                "title" : "http://purl.org/dc/elements/1.1/title",
                "description" : "http://purl.org/dc/elements/1.1/description"
            }
        },
        {
            "type" : "creative-work",
            "on_path" : "creative-works",
            "rdf_type" : "http://schema.org/CreativeWork",
            "properties" : {
                "name": "http://schema.org/name",
                "description": "http://schema.org/description"
            }
         },
         {
            "type" : "dossier",
            "composite_types" : ["document", "creative-work"],
            "on_path" : "dossiers",
            "properties" : [
                {
                    "name" : "name",
                    "mappings" : {
                        "document" : "title",
                        "creative-work" : "name"
                    }
                },
                {
                    "name" : "description",
                    "mappings" : {
                        "document" : "description"
                        // mapping for 'creative-work' is missing, hence same property name 'description' is assumed
                    }
                }
            ]
         }
    ]
}

Elasticsearch settings

Elasticsearch provides a lot of index configuration settings for analysis, logging, etc. Mu-search allows to provide this configuration for the whole domain and/or to be overridden (currently not merged!) on a per-type basis.

To specify Elasticsearch settings for all indexes, use default_settings next to the types specification:

  "types" : [
     // definition of the indexed types
  ],
  "default_settings" : {
    "analysis": {
      "analyzer": {
        "dutchanalyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding", "dutchstemmer"]
        }
      },
      "filter": {
        "dutchstemmer": {
          "type": "stemmer",
          "name": "dutch"
        }
      }
    }
  }

The content of the default_settings object is not elaborated here but can be found in the official Elasticsearch documentation. All settings provided in settings in the Elasticsearch configuration can be used verbatim in the default_settings of mu-search.

To specify Elasticsearch settings for a single type, use settings on the type index specification:

{
  "types": [
    {
      "type": "document",
      "on_path": "documents",
      ...
      "settings" : {
        "analysis": {
          "analyzer": {
            "dutchanalyzer": {
              "tokenizer": "standard",
              "filter": ["lowercase", "asciifolding", "dutchstemmer"]
            }
          },
          "filter": {
            "dutchstemmer": {
              "type": "stemmer",
              "name": "dutch"
            }
          }
      }
    },
    // other type definitions
  ]
}

Elasticsearch mappings

Elasticsearch provides the option to configure a mapping per index to specify how the properties of a document are stored and indexed. E.g. the type of the property value (string, date, boolean, ...), text-analysis to be applied on the value, etc.

In the mu-search configuration the Elasticsearch mappings can be passed via the mappings property per index type specification.

{
  "types": [
    {
      "type": "document",
      "on_path": "documents",
      ...
      "mappings" : {
        "properties": {
          "title" : { "type" : "text" },
          "description" : { "type" : "text" }
        }
      }
    },
    // other type definitions
  ]
}

The content of the mappings object is not elaborated here but can be found in the official Elasticsearch documentation. All settings provided in mappings.properties in the Elasticsearch configuration can be used verbatim in the es_settings of mu-search.

Index options

In the base scenario, indexes are created on an as-needed basis, whenever a new search profile (authorization rights and data type) is received. The first search query for a new search profile may therefore take more time to complete, because the index still needs to be built. Indexes can be manually re-indexed by triggering the POST /:type/index endpoint (see below).

Index metadata in the triple store

When an index is created, it is registered in the triplestore in the <http://mu.semte.ch/authorization> graph.

[To be completed... describe used model in the triplestore]

Persistent indexes

By default, on startup or restart of mu-search, all existing indexes are deleted, since data might have changed in the meantime. However, for sure in production environments, regenerating indexes might be a costly operation.

Persistence of indexes can be enabled via the persist_indexes flag at the root of the mu-search configuration file:

{
  "persist_indexes": true,
  "types": [
    // index type specifications
  ]
}

Possible values are true and false. Defaults to false.

Note that if set to true, the indexes may be out-of-date if data has changed in the application while mu-search was down.

Eager indexes

Configure indexes to be pre-built when the application starts. For each user search profile for which the indexes needs to be prepared, the authorization group names and their corresponding variables needs to be passed.

{
  "eager_indexing_groups": [
    [ 
      { "variables": ["company-x"], "name": "organization-read" }, 
      { "variables": ["company-x"], "name": "organization-write" },
      { "variables": [], "name": "public" }
    ],
    [
      { "variables": ["company-y"], "name": "organization-read" },
      { "variables": [], "name": "public" }
    ],
    [ 
      { "variables": [], "name": "clean" }
    ]
  ],
  "types": [
    // index type specifications
  ]
}

Note that if you want to prepare indexes for all user profiles in your application, you will have to provide an entry in the eager_indexing_groups list for each possible variable value. For example, if you have an authorization group defining a user can only access the data of his company (hence, the company name is a variable of the authorization group), you will need to define an eager index group for each of the possible companies in your application.

Additive index access rights

Additive indexes are indexes that may be combined to respond to a search query in order to fully match the user's authorization groups. If a user is granted access to multiple groups, indexes will be combined to calculate the response. Therefore, it's strongly adviced the indexes contain non-overlapping data.

Only indexes that are defined in the eager_indexing_groups will be used in combinations. If no combination can be found that fully matches the user's authorization group a single index will be created for the request's authorization groups.

Assume your application contains a company-specific user group in the authorization configuration; 2 companies: company X and company Y; and mu-search contains one search index definition for documents. A search index will be generated for documents of company X and another index will be generated for documents of company Y. If a user is granted access to documents of company X as well as for documents of company Y, a search query performed by this user will be effectuated by combining both search indexes.

A typical group to be specified as a single eager_indexing_group is { "variables": [], "name": "clean" }. The index will not contain any data, but will be used in the combination to fully match the user's allowed groups.

Delta integration

Mu-search integrates with the delta's generated by mu-authorization and dispatched by the delta-notifier.

Follow the "How to integrate mu-seach with delta's to update search indexes" guide to setup delta notification handling for mu-search. Deltas are expected in the v0.0.1 format of the delta notifier.

Full index invalidation

By default, when a delta notification is received by mu-search, all indexes containing data related to the changes are invalidated. The index will be rebuilt the next time it is searched.

Note that a change on one resource may trigger the invalidation of multiple indexes depending on the authorization groups.

Partial index updates

Alternate to full index invalidation, indexes can be dynamically updated on a per-document basis according to received deltas. When a delta is received, the document corresponding to the delta is updated (or deleted) in every index corresponding to the delta. This update is not a blocking operation: an update will not lock the index, so that a simultaneously received search request might be run on the un-updated index.

Note that a change on one resource may trigger the update of multiple indexes depending on the authorization groups.

Partial index updates are enabled by setting the automatic_index_updates flag at the root of the search configuation:

{
  "automatic_index_updates": true,
  "types": [
     // definition of the indexed types
  ]
}

Update batching and queueing

When a delta notification is handled, the update to be performed is pushed on the update queue. By default the queue is processed every minute. This timeout can be configured via update_wait_interval_minutes in the root of the search configuration:

{
  "automatic_index_updates": true,
  "update_wait_interval_minutes": 8,
  "types": [
     // definition of the indexed types
  ]
}

Increasing the interval has the advantage that updates on the same document will be applied only once, but has the downside that search results will not be up-to-date for a longer time. The optimal value depends on the application (number of updates, indexed properties, user expectations, etc.)

API

This section describes the REST API provided by mu-search.

In order to take access rights into account, each request requires the MU_AUTH_ALLOWED_GROUPS and MU_AUTH_USED_GROUPS headers to be present.

GET /:type/search

Endpoint to search the given :type index. The request format is JSON-API compliant and intended to match the request format of mu-cl-resources. Search filters are passed using query params.

A subset of the Elasticsearch Query DSL is supported, via the filter, page, and sort query parameters. More complex queries should be sent via POST /:type/search endpoint.

Examples

To search for documents on all fields:

GET /documents/search?filter[_all]=fish

To search for documents on the field name:

GET /documents/search?filter[name]=fish

To search for documents on multiple fields, combined with 'OR':

GET /documents/search?filter[name,description]=fish

To search for documents by their URI:

GET /documents/search?filter[:uri:]=http://data.semte.ch/documents/c020b82b-61f6-4264-93c5-aba0d09812d3
Searching in a file property

To search for a field indexing a file, a specific property of the resulting attachment object must be specified as filter key using the .-notation.

Currently the following properties are available on an attachment object:

  • content : text content of the file

For example, for a property attachment indexing a file, searching the content of the file is done using the following filter query:

GET /documents/search?filter\[attachment.content\]=Adobe"
Supported search methods

More advanced search options, such as term, range and fuzzy searches, are supported via flags. Flags are expressed in the filter key between : before the field name(s). E.g. the term search flag looks as follows:

GET /documents/search?filter[:term:tag]=fish

The following sections list the flags that are currently implemented:

Identifier queries
  • :id: Filter documents by their uuid. Multiple values should be comma-seperated, such as filter[:id:]=c9e0fe90-3785-4221-9c4b-bda70bd8d83b,e8cbc03a-97e0-4b97-931b-97caa720db14
  • :uri: Filter documents by their URI. Multiple values should be comma-seperated.
Term-level queries
  • :term: : Term query
  • :terms: : Terms query, terms should be comma-separated, such as: filter[:terms:tag]=fish,seafood
  • :prefix: : Prefix query
  • :wildcard: : Wildcard query
  • :regexp: : Regexp query
  • :fuzzy: : Fuzzy query with fuziness set to "AUTO" and allowing to match multiple fields.
  • :gt:,lt:, :gte:, :lte: : Range query
  • :lt,gt:, :lte,gte:, :lt,gte:, :lte,gt: : Combined range query, range limits should be comma-separated such as: GET /documents/search?filter[:lte,gte:importance]=3,7
  • :has:: Filter on documents having any value for the supplied field. To enable the filter, it's value must be t. E.g. filter[:has:translation]=t.
  • :has-no:: Filter on documents not having a value for the supplied field. To enable the filter, it's value must be t. E.g. filter[:has-no:translation]=t.
Full text queries
Custom queries

Currently searching on multiple fields is only supported for the following flag:

  • :phrase:
  • :phrase_prefix:
  • :fuzzy:

Multiple filter parameters are supported.

Examples

GET /documents/search?filter[:common:description]=a+cat+named+Barney

GET /documents/search?filter[:common,0.002:description]=a+cat+named+Barney

GET /documents/search?filter[:common,0.002,2:description]=a+cat+named+Barney

GET /documents/search?filter[:sqs:name]=Barney&[:has:address]=t
Sorting

Sorting is specified using the sort query parameter, providing the field to sort on and the sort direction (asc or desc). Multiple sort query parameters may be provided.

GET /documents/search?filter[name]=fish&sort[priority]=asc&sort[budget]=desc

Flags can be used to specify Elasticsearch sort modes to sort on multi-valued fields. The following sort mode flags are supported: :min:, :max:, :sum:, :avg:, :median:.

GET /documents/search?filter[name]=fish&sort[:avg:score]=asc

Note that sorting cannot be done on text fields, unless fielddata is enabled (not recommended). Keyword and numerical data types (declared in the type mapping) are recommended for sorting.

Pagination

Pagination is specified using the page[number] and page[size] query parameters:

GET /documents/search?filter[name]=fish&page[number]=2&page[size]=20

The page number is zero-based.

By default the search endpoint doesn't return exact result counts if the result set contains more than 10K items. To enable exact counts pass count=exact as query param (at the cost of some performance).

Highlighting

Highlighting is specified using the highlight[:fields:] query parameter, where a comma separated list of fields you want highlighted should be provided. You can use * as field name to highlight all fields.

No settings are currently supported.

See also https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html.

GET /documents/search?filter[:sqs:]=fish&highlight[:fields:]=name,description
GET /documents/search?filter[:sqs:]=fish&highlight[:fields:]=*
Removing duplicate results

When querying multiple indexes (with additive indexes), identical documents may be returned multiple times. Unique results can be assured using Elasticsearch's search result collapsing on the uuid field. The search result collapsing can be toggled using the collapse_uuids query parameter:

GET /documents/search?filter[name]=fish&collapse_uuids=t

However, note that count property in the response still designates total non-unique results.

[Experimental] POST /:type/search

Accepts a raw Elasticsearch Query DSL as request body to search the given :type index.

This endpoint is mainly intended for testing purposes and sending more complex queries than can be expressed with the GET /:type/search endpoint.

For security reasons, the endpoint is disabled by default. It can be enabled by setting the enable_raw_dsl_endpoint flag in the root of the configuration file:

{
  "enable_raw_dsl_endpoint": true,
  "types": [
     // definition of the indexed types
  ]
}

Admin endpoints

The admin endpoints can be used to manage the indexes. These endpoints should not be publicly exposed in your application, since they allow 'root' access when no authorization headers are specified on the request.

POST /:type/index

Updates the index(es) for the given :type. If the request is sent with authorization headers, only the authorized indexes are updated. Otherwise, all indexes for the type are updated.

Type _all will update all indexes.

POST /:type/invalidate

Invalidates the index(es) for the given :type. If the request is sent with authorization headers, only the authorized indexes are invalidated. Otherwise, all indexes for the type are invalidated.

Type _all will invalidate all indexes.

An invalidated index will be updated before executing a new search query on it.

Note that the search index is only marked as invalid in memory. I.e the index is not removed from Elasticsearch nor the triplestore. Hence, on restart of mu-search, the index will be considered valid again.

DELETE /:type

Deletes the index(es) for the given :type in Elasticsearch and the triplestore. If the request is sent with authorization headers, only the authorized indexes are deleted. Otherwise, all indexes for the type are deleted.

Type _all will delete all indexes.

A deleted index will be recreated before executing a new search query on it.

POST /update

Processes an update of the delta-notifier. See delta integration.

Currently only delta format v.0.0.1 is supported.

Configuration options

This section gives an overview of all configurable options in the search configuration file config.json. Most options are explained in more depth in other sections.

  • (*) persist_indexes : flag to enable the persistence of search indexes on startup. Defaults to false. See persist indexes.
  • (*) automatic_index_updates : flag to apply automatic index updates instead of invalidating indexes on receiving deltas. Defaults to false. See delta integration.
  • eager_indexing_groups : list of user search profiles (list of authorization groups) to be indexed at startup. Defaults to []. See eager indexes.
  • (*) batch_size : number of documents loaded from the RDF store and indexed together in a single batch. Defaults to 100.
  • (*) max_batches : maximum number of batches to index. May result in an incomplete index and should therefore only be used during development. Defaults to 1.
  • (*) number_of_threads : number of threads to use during indexing. Defaults to 1.
  • (*) update_wait_interval_minutes : number of minutes to wait before applying an update. Allows to prevent duplicate updates of the same documents. Defaults to 1.
  • (*) common_terms_cutoff_frequency : default cutoff frequency for a Common terms query. Defaults to 0.0001. See supported search methods.
  • (*) enable_raw_dsl_endpoint : flag to enable the raw Elasticsearch DSL endpoint. This endpoint is disabled by default for security reasons.
  • (*) attachments_path_base : path inside the Docker container where files for the attachment pipeline are mounted. Defaults to /data.

All options prefixed with (*) can also be configured using an UPPERCASED variant as Docker environment variables on the mu-search container. E.g. the batch_size option can be set via the environment variable BATCH_SIZE. Environment variables take precedence over settings configured in config.json.

In development mode (setting the environment variable RACK_ENV to development), the application will listen for changes in config.json. Any change will trigger a complete reload of the full application, including deleting existing indexes, and building any default indexes specified in eager indexing. This behaviour overrules the persist_indexes flag.

Logging

Log messages are logged in a specific scope. A different log level can be configured per scope via environment variables like LOG_SCOPE_{scopeName}>.

E.g.

search:
  environment:
     LOG_SCOPE_TIKA: "warn"
     LOG_SCOPE_DELTA: "debug"

The following scopes are known:

  • SETUP: system setup and initialization (default: info)
  • INDEX_MGMT: creation, updates and deletion of indexes (default: info)
  • INDEXING: indexing of documents (default: info)
  • SEARCH: execution of search queries (default: warn)
  • TIKA: extraction and indexing of file content using Tika (default: warn)
  • ELASTICSEARCH: all communication with Elasticsearch (default: error)
  • SPARQL: all communication with the database (default: warn)
  • AUTHORIZATION: incoming access rights on requests (default: warn)
  • DELTA: handling of incoming delta's (default: info)
  • UPDATE_HANDLER: processing of the updates triggered by delta's (default: info)

The same log levels as the mu-ruby-template are available:

  • debug
  • info
  • warn
  • error
  • fatal

Environment variables

This section gives an overview of all options that are configurable via environment variables. The options that can be configured in the config.json file as well are not repeated here. This list contains options that can only be configured via environment variables.

  • MAX_REQUEST_URI_LENGTH : maximum length of an incoming request URL. Defaults to 10240.
  • MAX_REQUEST_HEADER_LENGTH : maximum length of the headers of an incoming request. Defaults to 1024000.
  • MAXIMUM_FILE_SIZE : maximum size in bytes of files to extract and index content from. Defaults to 209715200.
  • ELASTIC_READ_TIMEOUT : timeout in seconds of requests to Elasticsearch. Defaults to 180.

Discussions

Why a custom Elasticsearch docker image?

The mu-semtech/search-elastic-backend is a custom Docker image based on the official Elasticsearch image. Providing a custom image allows better control on the version of Elasticsearch, currently v7.2.0, used in combination with the mu-search service.

The custom image also makes sure the required Elasticsearch plugins, such as the ingest-attachments plugin, are already pre-installed making the integration of mu-search in your stack a lot easier.

Authorization groups vs indexes

Access rights are determined according to the contents of two headers, MU_AUTH_ALLOWED_GROUPS and MU_AUTH_USED_GROUPS.

Currently, a separate Elasticsearch index is created for each combination of document type and authorization group.

[To be completed...]

About

Search facility for mu-semtech, powered by ElasticSearch

License:MIT License


Languages

Language:Ruby 99.7%Language:Dockerfile 0.3%