allegro / solr-ids-export-plugin

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

IdsExportPlugin

A plugin (to be more precise: set of plugins) for Solr allowing time-efficient export of Ids of all found documents (or any DocValues-enabled field values) in comma-separated format without sorting. Lack of result sorting results in significantly better performance then Solr build-in /export endpoint.

Note: the plugin is developed and tested on standalone Solr instance, without any promises nor guarantees about Solr Cloud.

Requirements

  • Solr version > 7.2 (tested with 7.2.1)
  • Solr running in standalone mode (Solr Cloud not supported)

Motivation

The initial motivation for creating this plugin was ability to produce output, which could be used as a direct input for Terms Query Parser in another Solr request. Example:

First, search Car Brands index and give me IDs of all brands, which sell in Poland

http://localhost:8080/car_brands/select?availability:pl&fq={!ids field=brand_id}&wt=ids
Output: vw,opel,audi

Then, search Car Models index and give me models with electric engine:

http://localhost:8080/car_models/select?engine:electric&fq={!terms f=brand_id}vw,opel,audi     

Other possible use cases include:

  • simplifying batch jobs which do some calculation based on a full result set and doesn't require any document order (f.ex. recalculate popularity for all product from Poland every day) - removes necessity of paging
  • creating reports - finding all documents matching criteria
  • replacing /export endpoint when sorting is not required

Basic concepts

IdsExportPlugin consists of:

  • IdsExportFilter
  • IdsExportSearchComponent
  • IdsExportResponseWriter

The idea behind IdsExportPlugin is to use a post-filter (IdsExportFilter) as the last filter during the request processing phase, which will collect all found Document Ids in an optimized data structure. Then IdsExportSearchComponent will write those Ids to the response, and IdsExportResponseWriter will output them in comma-separated format.

IdsExportFilter

IdsExportFilter is a Solr post-filter. In Solr terminology, a filter is a piece of code which decides, whether the document matches search criteria and should be included in the response. A post-filter will be executed after regular filters, thanks to this it works on limited set of documents, already filtered by previous filters.

IdsExportFilter implements the post-filter interface, but doesn't really decides if a document matches search criteria or not - it accepts all documents - but instead it collects certain field values from documents, and stores them in a data structure. The field name is defined in the request URL or configuration.

This filter was initially designed to read values of the documents' unique key, but in fact it can read values of any field, which has DocValues enabled. In this document we will refer to those values as Ids.

Internally, Ids are stored in a data structure:

  • in case of fields with Numeric or Sorted Numeric DocValues, Ids (which are longs) are stored inside com.carrotsearch.hppc.LongArrayList (data structure based on array of primitive longs)
  • in case of fields with Binary, Sorted or SortedSet DocValues, Ids (which are Strings) are stored as an ArrayList of org.apache.lucene.util.BytesRef (lucene-optimized type for string binary values, mainly used for Strings)

Ids don't need to be unique - in case of repeated values, it will be stored a couple of times.

IdsExportSearchComponent

IdsExportSearchComponent is a search component (piece of code which executes after request processing, but before sending the response) which simply adds the collected Ids to the Solr response under the key defined in the configuration. After this operation, response will contain additional list of Ids of all documents.

IdsExportResponseWriter

The last component, IdsExportResponseWriter, transforms the Solr response into comma-separated list of Ids. All additional response elements are skipped. The MIME type of the response is set to text/plain, encoding set to UTF-8.

Note: usage of IdsExportResponseWriter is optional. If you don't want a comma-separated format and you're fine with standard Solr JSON/XML/etc. response - then you don't have to use IdsExportResponseWriter.

Installation

  1. Add JAR file to Solr's classpath (https://lucene.apache.org/solr/guide/7_2/lib-directives-in-solrconfig.html)

  2. Add to solrconfig.xml following code

    <queryParser name="ids" class="pl.allegro.search.solr.ids.filter.IdsExportFilterParserPlugin">
        <int name="bufferInitialSize">100000</int>
    </queryParser>
    <searchComponent name="ids" class="pl.allegro.search.solr.ids.searchcomponent.IdsExportSearchComponent">
       <str name="responseKey">ids</str>
    </searchComponent>
    <queryResponseWriter name="ids" class="pl.allegro.search.solr.ids.responsewriter.IdsExportResponseWriter">
       <str name="responseKey">ids</str>
    </queryResponseWriter>

    The exact meaning of configuration parameters is described in Configuration

    Each of those components may be registered under any valid name.

    • The name of the IdsExportFilterParserPlugin (which is a factory for IdsExportFilter) will be reflected in Solr URL (you will use it in requests to activate the plugin) - please give it some reasonable name. In this document we will assume the name ids
      http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}
    • We strongly recommend to give IdsExportSearchComponent the same name as in IdsExportFilterParserPlugin for simplicity.
    • The name of the IdsExportResponseWriter will be reflected in Solr URL (you will use it to change the output format) - please give it some reasonable name. We recommend the same name as in IdsExportFilterParserPlugin for simplicity. In this document we will assume the name ids
      http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}&wt=ids

Usage examples

http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}&wt=ids

# Will output a comma-separated values of `product_id` field from all documents in the index.

Example response:
1,2,3,4,5,6
http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}&rows=2

# Will output a list of values of `product_id` field as an additional Solr's response attribute.

Example response:
{
    "responseHeader": {
        "status": 0,
        "QTime": 2,
        "params": {
            "q": "*:*",
            "fq": "{!ids field=product_id}",
            "rows": "2"
        }
    },
    "response": {
        "numFound":6,
        "start": 0,
        "docs": [
            {
            "product_name": "Test 0",
            "product_id": "0"
            },
            {
            "product_name": "Test 1",
            "product_id": "1"
            }
        ]
    },
    "ids": [
        "0",
        "1",
        "2",
        "3",
        "4",
        "5",
        "6"
    ]
}

# Note: ids doesn't respect rows/start parameters - will always output everything found.

Configuration

IdsExportFilterParserPlugin configuration options available in solrconfig.xml:

  • bufferInitialSize - initial size (in number of items) of the buffer for storing Ids. It should be a bit bigger than estimated average response size. Generally every number will work, however:
    • if set too low, the buffer will be extended a couple of times during request processing, resulting in increased CPU and memory consumption
    • if set too high, you will unnecessarily allocate a lot of memory Default value: 100 000.
  • defaultIndexField - name of the field, where Ids are stored. This can be configured also on a per-request basics via URL parameter field, however in case of missing URL parameter the default configured here will be used. Default value: doc_id.

IdsExportFilter configuration options available in URL:

  • field - name of the field, where Ids are stored. Default value: configured in defaultIndexField in solrconfig.xml
    http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}

IdsExportSearchComponent configuration options available in solrconfig.xml:

  • responseKey - key in the Solr response where Ids should be stored. Default value: ids.

IdsExportResponseWriter configuration options available in solrconfig.xml:

  • responseKey - key in the Solr response where Ids are stored. The final Solr output will contains only comma-separated values from this field. Default value: ids.
  • separator - a separator (char or String) used to separate values in the final Solr output. In this document we will assume it is a comma, therefore we have used phrase "comma-separated" a couple of times, however it's possible to change it. Default value: , (comma)

Performance

Single query time comparison

In this test scenario, a single Solr instance was processing only a single request at once. Each request was sent three times to Solr:

  1. To /select endpoint, with rows=0, and IdsExportPlugin enabled
  2. To /export endpoint, with sorting set to Ids (sorting was obligatory)
  3. To /select endpoint, with rows set to expected size od result set and sorting set to Ids, without IdsExportPlugin

Note: given times are the total request time, including sending HTTP request, searching and downloading HTTP response. Technically, times were measured using linux time command, which measured execution time of curl with a given query. Although this approach is not a "clean" benchmark of the plugin itself, it also takes into account the overhead required to download a potentially large response - and this also favors IdsExportPlugin, due to the very concise format of the output data - but it is also the closest to the actual use cases of the plugin.

Results (times in seconds):

numFound IdsExportPlugin /export /select
2 0.036 0.012 0.008
1082 0.012 0.128 0.136
12957 0.02 1.956 1.949
225816 0.149 55.105 59.068
1841320 0.681 393.532 396.918
5971685 2.232 831.853 822.736

Multi-threaded performance

In this test scenario, a single Solr instance was processing requests incoming via multiple connections concurrently. Each request was sent to two endpoints:

  1. To /select endpoint, with rows=0, and IdsExportPlugin enabled
  2. To /export endpoint, with sorting set to Ids (sorting was obligatory)

The test scenario has been divided into three test cases. In each test case a set of unique phrases has been used, selected to give the expected number of results:

  1. between 20 000 and 50 000 (phrases giving "small" result sets)
  2. between 50 000 and 280 000 (phrases giving "medium sized" result sets)
  3. between 280 000 and 3 100 000 (phrases giving "large" result sets)

This test scenario was carried out using Apache JMeter. All results presented below come from JMeter results.

Results:

concurrent connections requests per connection total request count result set size per request IdsExportPlugin RPS IdsExportPlugin avg IdsExportPlugin Max /export RPS /export avg /export max
30 80 2400 20000-50000 489.50 rps 47.00 ms 190.00 ms 3.00 rps 9414.00 ms 26972.00 ms
30 80 2400 50000-280000 199.10 rps 127.00 ms 325.00 ms 0.80 rps 35663.00 ms 126313.00 ms
30 22 660 280000-3100000 30.00 rps 796.00 ms 2669.00 ms 0.10 rps 230305.00 ms 812294.00 ms

Performance - summary

The presented results clearly show that the use of IdsExportPlugin highly speeds up Ids export from Solr - response time and throughput may be a couple of hundred times better than in case of Solr built-in /export or /select endpoints.

The largest performance killer /export and /select is result set sorting. IdsExportPlugin does not perform any sorting, just outputs all found Ids in order they are processed by Solr.

Memory consumption

Memory consumption of IdsExportPlugin is not higher then memory consumption of the standard /export endpoint.

On the one hand, IdsExportPlugin require a data structure which size is proportional to the amount of found documents, so the bigger result sets are found, the more memory is required for processing.

On the other hand, standard /export endpoint also require some data structure with size proportional to the result set size for sorting purposes. Therefore the overall memory footprint of IdsExportPlugin will not be higher then /export's.

Pro tip:

Generally it's best to use IdsExportPlugin with fields, which have DocValues of type Numeric or SortedNumeric - in this case the data structure is com.carrotsearch.hppc.LongArrayList, which internally relies on array of primitive longs.

All other field types will store it's Ids inside ArrayList of org.apache.lucene.util.BytesRef - an optimized way of storing Strings.

Build

./gradlew clean build

License

This software is published under Apache License 2.0.

About

License:Apache License 2.0


Languages

Language:Java 100.0%