[RFC] Split Large Records Support

Question

[RFC] Split Large Records Support

kiler129 opened this issue 5 years ago · comments

Background

Algolia recommends splitting large records (e.g. blog posts) into smaller chunks for better search relevance. There seems to be no support for this in the Symfony bundle

Suggestion

I think that functionality should be implemented in a flexible way, allowing anyone to define decoupled business logic around that. In order to do that from my perspective couple basic principles has to be met:

object ids should contain FQCN as well as original ID (similar to aggregators)
chunk splitting should not collide with aggregations
original ID should be persisted in a separate field in the index (configurable per index)
entities should not contain logic for splitting
each index should be able to define field used for splitting
chunks should be easily invalidated

Architecture

So far this is more a rough idea than a plan. First of all I believe the chunk splitting should happen after all normalizers were executed. The bundle should not interfere with normalization, especially since AFAIK you cannot emit multiple objects for a single normalized object.

I suggest that configuration gets a format similar to:

algolia_search:
    prefix: '%kernel.environment%_'
    settingsDirectory: /config/algolia_search
    chunk_id_transformer: app.foo.id_transformer # by default set to a service provided by the bundle, can be customized 

    indices:
        - name: foos
          class: App\Entity\Foo
          chunking:
              enabled: true
              id_transformer: app.foo.id_transformer # by default set to 'algolia_search.chunk_id_transformer'
              body_transformer: app.foo.chunk_transformer
              context:
                  - custom_marker
                  - something more

Additionally two new interfaces should be introduced:
ChunkTransformerInterface

<?php

interface ChunkTransformerInterface
{
    /**
     * Transforms id into chunk id
     *
     * @param object $entity Original entity
     * @param array  $normalized Normalized form of the entity
     * @param array  $chunkContext Defined in the configuration
     * @param array  $serializationContext Context used during serialization
     *
     * @return iterable|array Stream of chunks (for lower PHPs a simple array with chunks)
     */
    public function transformToChunks($entity, array $normalized, array $chunkContext, array $serializationContext);
}

IdTransformerInterface
Responsible for generating new object ID based on a chunk. Called after chunk transformation for each chunk.

<?php

interface IdTransformerInterface
{
    /**
     * Transforms id into chunk id
     *
     * @param object $entity Original entity
     * @param int    $chunkSequence
     * @param array  $chunk
     * @param array  $context Defined in the configuration
     *
     * @return string
     */
    public function transformId($entity, $chunkSequence, array $chunk, array $context = []);
}

RFC

My suggestion allows for any business scenarios for chunking and variety of implementation. One can simply decide to implement a method on the entity itself and implement ChunkTransformerInterface while putting very simplistic implementation like return explode('.', $normalized['body']); while others may use advanced NL-aware strategies (which actually is more what we need).

I will like to hear from you guys (@nunomaduro @alcaeus ? ;)) what do you think about the whole proposition as well as the suggested implementation. I will be probably be able to offer a PR for this if the change is desired.

Nuno Maduro · Answer 1 · Tue Feb 19 2019 22:50:20 GMT+0800 (China Standard Time)

I need to read your rfc carefully, but FYI we have this feature on the laravel integration: https://www.algolia.com/doc/framework-integration/laravel/advanced-use-cases/split-large-records.

Gregory House · Answer 2 · Tue Feb 19 2019 23:01:22 GMT+0800 (China Standard Time)

@nunomaduro I saw that docs page. My proposition is similar but more flexible and allows for clear separation of concerns to avoid doing nasty switches in the splitter itself based on the entity class (which also makes the process faster).

Additionally giving access to the context allows writing an universal splitter for e.g. articles and comments where the only difference is split length (say 1000 words vs. 100 words).

Additionally I deliberately omitted support for magic splitBody() method, since it introduces learning curve and creates potential conflicts while in the same time requiring detection. In my proposition I assumed the entity itself can implement the ChunkTransformerInterface.

Gregory House · Answer 3 · Fri Mar 08 2019 01:47:09 GMT+0800 (China Standard Time)

@nunomaduro Any word on this? I can poke on implementation of this, but it doesn't make sense if it will be left rooting due to architectural doubts.

Nuno Maduro · Answer 4 · Fri Mar 08 2019 16:40:35 GMT+0800 (China Standard Time)

@kiler129 Your idea seems fine, but I don't want rush things. Let's hold this for some weeks.